900,000 Driving Videos Exposed a Fundamental Problem in AI World Models

Every human driver learns to navigate the physical world through experience — thousands of hours of turns, stops, near-misses, and split-second reactions in conditions no driving instructor could script. We learn the physics intuitively: wet roads mean longer stops, steep hills change braking feel, a bus drifting into your lane demands an immediate correction.

The AI systems being built to understand driving — world models — need to learn these same things. But tell a state-of-the-art world model to turn left, and it drives straight. Instruct it to brake, and it maintains speed. The model isn't broken — it's faithfully reproducing the dominant pattern in its training data: constant-velocity forward driving.

900,000 videos proved it's a data problem, not a model problem

A result from DrivingDojo, presented at NeurIPS 2024, explains why. Most widely used academic driving datasets — nuScenes, ONCE, OpenDV-YouTube — were collected for perception benchmarks, not for learning action-conditioned dynamics. Their behavioral distribution is heavily skewed toward routine forward driving, while turns, evasive maneuvers, and complex multi-agent interactions appear only sparsely.

The DrivingDojo team, working from Meituan's autonomous delivery fleet, collected 900,000 driving video clips — approximately 7,500 hours — across multiple Chinese cities with diverse maneuver distributions. Models trained on this data produced action-faithful generations. Vehicles turned when told to turn. The difference was entirely attributable to the training distribution.

Here is the kind of event that perception datasets systematically miss — a U-turn at nearly 50 mph on a rural Florida highway. No scripted collection route would include this maneuver. It exists in the data because a real driver on a real road decided to reverse course:

Metadata

Event Type

SWERVING

Timestamp

Feb 15, 2026 17:25

Location

26.4503334, -81.6645603

Speed

57.6 mph

Event ID

699201bce834f725d2c17e35

GNSS Data

#	Latitude	Longitude	Altitude	Time
1	26.450419	-81.663473	-14.2m	5:25:33 PM
2	26.450414	-81.664212	-14.3m	5:25:37 PM
3	26.450417	-81.664604	-14.2m	5:25:41 PM
4	26.450338	-81.664548	-14.3m	5:25:45 PM
5	26.450381	-81.664154	-14.4m	5:25:49 PM
6	26.450385	-81.663508	-14.4m	5:25:52 PM
7	26.450389	-81.662706	-14.4m	5:25:56 PM
8	26.450393	-81.661788	-14.3m	5:26:00 PM

Try this API query

curl

curl https://beemaps.com/api/developer/aievents/699201bce834f725d2c17e35\
?includeGnssData=true\&includeImuData=true \
  -H "Authorization: Basic <your-api-key>"

The perception-prediction mismatch

It helps to understand why the existing datasets don't work. The datasets the field has relied on for the past decade — nuScenes, Waymo Open, KITTI, Argoverse — were constructed for perception: 3D bounding boxes, semantic segmentation masks, point cloud labels. The evaluation metrics — mAP, IoU, MOTA — measure what objects are present in this scene and where are they?

World models answer a categorically different question:

Given the current scene state and an ego-action, what is the likely next state?

This is a conditional distribution over future video frames, not a classification over spatial labels. Training it requires temporal sequences with sufficient diversity in the conditioning variable — the ego-vehicle's action space.

Why this happens: the model is trying to learn a relationship between actions and outcomes. But if 95% of the training data is "drive straight at constant speed," the model learns to ignore the action entirely. It doesn't matter what you tell it to do — the overwhelming pattern in the data is "keep going forward," so that's what it predicts.

The action conditioning becomes a no-op. DrivingDojo's empirical results are consistent with this interpretation — and it is what you would expect from the structure of the training data.

Tu et al., in their February 2025 survey, arrive at a compatible conclusion: world models require "multi-sensor, multimodal operational architectures" that perception benchmarks were never designed to provide.

Three gaps that more data alone won't close

The temptation is to say "we just need more data." But we think the problem is more precise than that. There are three distinct gaps that synthetic data, simulation, and internet-scraped video each fail to close.

1. Long-tail events

DrivingDojo-Open includes 3,700 video clips curated for rare scenarios:

Adverse weather
Foreign objects on the roadway
Near-collisions
Unusual traffic configurations

Including these clips measurably improved generation quality on out-of-distribution test scenarios.

The fundamental issue: long-tail events are definitionally rare, so any fixed-size collection effort will underrepresent them. The only way to capture them at scale is a large fleet continuously recording over long time horizons, paired with an event-detection system that identifies and indexes the high-information moments.

You cannot script a near-miss. You cannot plan for debris on the highway at kilometer 47 during a rainstorm. Coverage of these events is a function of fleet-hours and geographic breadth.

2. Geographic diversity

Liu et al. surveyed 265 autonomous driving datasets and documented persistent geographic concentration:

KITTI — limited to German urban areas
Waymo Open — Phoenix, San Francisco, and a few other American cities
nuScenes — Boston and Singapore
DrivingDojo — Chinese cities, delivery-vehicle perspective only

This matters more than it might seem at first. Driving behavior is a function of local norms, infrastructure design, and cultural convention. Lane discipline in Mumbai is structurally different from Munich. Roundabout negotiation varies between left-hand and right-hand driving countries.

A world model trained on one geographic distribution will likely fail on others — not because of a modeling limitation, but because the underlying behavior distribution is genuinely different.

3. Sensor grounding

This is perhaps the most underappreciated gap.

A 10-second video clip of a vehicle swerving is ambiguous. Was it an evasive maneuver or a lane change? What was the lateral acceleration? What was the velocity profile?

Without synchronized sensors, a model must infer physics from pixel changes. This is possible in principle, but extraordinarily sample-inefficient compared to providing ground truth directly:

IMU stream → explicit supervision on forces
30Hz GNSS trace → explicit supervision on trajectories
Calibrated speed → explicit supervision on velocity

These aren't auxiliary signals. For a world model learning physical dynamics, they may be the most important training signal available.

This event illustrates the point. A vehicle on the 101 freeway in Los Angeles braked hard at 65 mph — 1.4 m/s² deceleration — when a car cut across lanes ahead. Watch the video. It looks like typical LA traffic. Only the sensor data reveals this was an emergency stop:

Metadata

Event Type

HARSH BRAKING

Timestamp

Jan 25, 2026 23:10

Location

34.1230304, -118.3411565

Speed

64.7 mph

Event ID

6976a302e13d2ed988573033

GNSS Data

#	Latitude	Longitude	Altitude	Time
1	34.119326	-118.337920	151.9m	11:09:51 PM
2	34.120291	-118.338538	157.9m	11:09:55 PM
3	34.121219	-118.339284	164.0m	11:09:59 PM
4	34.122091	-118.340158	170.7m	11:10:03 PM
5	34.122886	-118.341017	176.9m	11:10:07 PM
6	34.123061	-118.341194	178.3m	11:10:11 PM
7	34.123331	-118.341511	180.4m	11:10:16 PM
8	34.123750	-118.342009	183.9m	11:10:20 PM

Try this API query

curl

curl https://beemaps.com/api/developer/aievents/6976a302e13d2ed988573033\
?includeGnssData=true\&includeImuData=true \
  -H "Authorization: Basic <your-api-key>"

A video-only dataset would label this as routine freeway driving. But the deceleration profile — and the GNSS altitude climbing 30 meters over the clip as the vehicle ascends the Cahuenga Pass — tells a different story. For a world model learning braking physics on grades, this is exactly the kind of sample that matters.

Internet-scraped datasets like OpenDV-YouTube provide 1,700 hours of driving footage with none of this grounding. Simulation can provide synthetic sensor streams, but suffers from a sim-to-real gap in texture statistics, lighting, and material properties — worst precisely for the rare events that matter most.

The four requirements for world-model training data

If you take the DrivingDojo results seriously — and we think you should, given the venue and experimental clarity — then the requirements for world-model training data become fairly clear:

#	Requirement	What it means
1	Diverse action distributions	Substantial representation of turning, braking, merging, yielding, and recovery — not just constant-velocity forward driving. A hard requirement for action-conditioned generation to work at all.
2	Synchronized multi-modal sensors	Video paired with IMU (6-axis), GNSS (10Hz+), and calibrated speed. The difference between learning pixel correlations and learning dynamics.
3	Geographic breadth	Structurally different driving environments — varied road geometries, traffic densities, regulatory regimes. Variations within a single metro area are insufficient.
4	Continuous long-tail capture	Rare events emerging organically from large fleet-hours, not scripted scenarios. An event detection and indexing system to make them tractable.

DrivingDojo satisfied (1) and partially (4) from a single fleet in a single country — and that alone was sufficient to transform model quality.

Nobody has yet tested what happens when you satisfy all four simultaneously. We don't know exactly how the gains scale. But the direction of the DrivingDojo result is clear, and we think the diversity-quality relationship has a long way to go before it saturates.

What this looks like in practice

Bee Maps has mapped 37% of the global road network with a dashcam fleet spanning dozens of countries — each camera continuously recording and classifying these moments. The AI Event Videos API indexes them by type and makes them queryable with full sensor context. Three examples — each mapping to a different requirement:

Three events from three countries — and the API surfaces thousands more. Missed-turn reversals on European residential streets, animal evasions on rural Portuguese roads, highway near-collisions near Porto. Every one captured with synchronized video, GNSS, IMU, and calibrated speed — same data schema, same API endpoint.

The data infrastructure DrivingDojo built for one country is, in many ways, already operating globally.

The scaling question

What makes DrivingDojo one of the more important results in the world model literature is that it isolates the variable that matters. The architectures aren't novel. The training procedures are standard. What changed was the training distribution — and that alone was enough to transform model behavior from broken to functional.

This suggests something that we find compelling: if diversity in one country's fleet data produces this kind of improvement, what happens with fleet data from dozens of countries? If 7,500 hours yields these gains, what about 75,000?

We don't have a definitive answer yet. But the direction seems clear, and we believe the teams that figure out the data supply chain for world models — not just the architectures, not just the compute, but the continuous pipeline of geographically diverse, sensor-grounded, long-tail-enriched video — will be the ones that build models that actually generalize to the physical world.

AI Event Videos API API Playground Docs Get API Key