Every human driver learns to navigate the physical world through experience — thousands of hours of turns, stops, near-misses, and split-second reactions in conditions no driving instructor could script. We learn the physics intuitively: wet roads mean longer stops, steep hills change braking feel, a bus drifting into your lane demands an immediate correction.
The AI systems being built to understand driving — world models — need to learn these same things. But tell a state-of-the-art world model to turn left, and it drives straight. Instruct it to brake, and it maintains speed. The model isn't broken — it's faithfully reproducing the dominant pattern in its training data: constant-velocity forward driving.
900,000 videos proved it's a data problem, not a model problem
A result from DrivingDojo, presented at NeurIPS 2024, explains why. Most widely used academic driving datasets — nuScenes, ONCE, OpenDV-YouTube — were collected for perception benchmarks, not for learning action-conditioned dynamics. Their behavioral distribution is heavily skewed toward routine forward driving, while turns, evasive maneuvers, and complex multi-agent interactions appear only sparsely.
The DrivingDojo team, working from Meituan's autonomous delivery fleet, collected 900,000 driving video clips — approximately 7,500 hours — across multiple Chinese cities with diverse maneuver distributions. Models trained on this data produced action-faithful generations. Vehicles turned when told to turn. The difference was entirely attributable to the training distribution.
Here is the kind of event that perception datasets systematically miss — a U-turn at nearly 50 mph on a rural Florida highway. No scripted collection route would include this maneuver. It exists in the data because a real driver on a real road decided to reverse course:

Metadata
| # | Latitude | Longitude | Altitude | Time |
|---|---|---|---|---|
| 1 | 26.450419 | -81.663473 | -14.2m | 5:25:33 PM |
| 2 | 26.450414 | -81.664212 | -14.3m | 5:25:37 PM |
| 3 | 26.450417 | -81.664604 | -14.2m | 5:25:41 PM |
| 4 | 26.450338 | -81.664548 | -14.3m | 5:25:45 PM |
| 5 | 26.450381 | -81.664154 | -14.4m | 5:25:49 PM |
| 6 | 26.450385 | -81.663508 | -14.4m | 5:25:52 PM |
| 7 | 26.450389 | -81.662706 | -14.4m | 5:25:56 PM |
| 8 | 26.450393 | -81.661788 | -14.3m | 5:26:00 PM |
Try this API query
curl https://beemaps.com/api/developer/aievents/699201bce834f725d2c17e35\ ?includeGnssData=true\&includeImuData=true \ -H "Authorization: Basic <your-api-key>"
The perception-prediction mismatch
It helps to understand why the existing datasets don't work. The datasets the field has relied on for the past decade — nuScenes, Waymo Open, KITTI, Argoverse — were constructed for perception: 3D bounding boxes, semantic segmentation masks, point cloud labels. The evaluation metrics — mAP, IoU, MOTA — measure what objects are present in this scene and where are they?
World models answer a categorically different question:
Given the current scene state and an ego-action, what is the likely next state?
This is a conditional distribution over future video frames, not a classification over spatial labels. Training it requires temporal sequences with sufficient diversity in the conditioning variable — the ego-vehicle's action space.
Why this happens: the model is trying to learn a relationship between actions and outcomes. But if 95% of the training data is "drive straight at constant speed," the model learns to ignore the action entirely. It doesn't matter what you tell it to do — the overwhelming pattern in the data is "keep going forward," so that's what it predicts.
The action conditioning becomes a no-op. DrivingDojo's empirical results are consistent with this interpretation — and it is what you would expect from the structure of the training data.
Tu et al., in their February 2025 survey, arrive at a compatible conclusion: world models require "multi-sensor, multimodal operational architectures" that perception benchmarks were never designed to provide.
Three gaps that more data alone won't close
The temptation is to say "we just need more data." But we think the problem is more precise than that. There are three distinct gaps that synthetic data, simulation, and internet-scraped video each fail to close.
1. Long-tail events
DrivingDojo-Open includes 3,700 video clips curated for rare scenarios:
- Adverse weather
- Foreign objects on the roadway
- Near-collisions
- Unusual traffic configurations
Including these clips measurably improved generation quality on out-of-distribution test scenarios.
The fundamental issue: long-tail events are definitionally rare, so any fixed-size collection effort will underrepresent them. The only way to capture them at scale is a large fleet continuously recording over long time horizons, paired with an event-detection system that identifies and indexes the high-information moments.
You cannot script a near-miss. You cannot plan for debris on the highway at kilometer 47 during a rainstorm. Coverage of these events is a function of fleet-hours and geographic breadth.
2. Geographic diversity
Liu et al. surveyed 265 autonomous driving datasets and documented persistent geographic concentration:
- KITTI — limited to German urban areas
- Waymo Open — Phoenix, San Francisco, and a few other American cities
- nuScenes — Boston and Singapore
- DrivingDojo — Chinese cities, delivery-vehicle perspective only
This matters more than it might seem at first. Driving behavior is a function of local norms, infrastructure design, and cultural convention. Lane discipline in Mumbai is structurally different from Munich. Roundabout negotiation varies between left-hand and right-hand driving countries.
A world model trained on one geographic distribution will likely fail on others — not because of a modeling limitation, but because the underlying behavior distribution is genuinely different.
3. Sensor grounding
This is perhaps the most underappreciated gap.
A 10-second video clip of a vehicle swerving is ambiguous. Was it an evasive maneuver or a lane change? What was the lateral acceleration? What was the velocity profile?
Without synchronized sensors, a model must infer physics from pixel changes. This is possible in principle, but extraordinarily sample-inefficient compared to providing ground truth directly:
- IMU stream → explicit supervision on forces
- 30Hz GNSS trace → explicit supervision on trajectories
- Calibrated speed → explicit supervision on velocity
These aren't auxiliary signals. For a world model learning physical dynamics, they may be the most important training signal available.
This event illustrates the point. A vehicle on the 101 freeway in Los Angeles braked hard at 65 mph — 1.4 m/s² deceleration — when a car cut across lanes ahead. Watch the video. It looks like typical LA traffic. Only the sensor data reveals this was an emergency stop:

Metadata
| # | Latitude | Longitude | Altitude | Time |
|---|---|---|---|---|
| 1 | 34.119326 | -118.337920 | 151.9m | 11:09:51 PM |
| 2 | 34.120291 | -118.338538 | 157.9m | 11:09:55 PM |
| 3 | 34.121219 | -118.339284 | 164.0m | 11:09:59 PM |
| 4 | 34.122091 | -118.340158 | 170.7m | 11:10:03 PM |
| 5 | 34.122886 | -118.341017 | 176.9m | 11:10:07 PM |
| 6 | 34.123061 | -118.341194 | 178.3m | 11:10:11 PM |
| 7 | 34.123331 | -118.341511 | 180.4m | 11:10:16 PM |
| 8 | 34.123750 | -118.342009 | 183.9m | 11:10:20 PM |
Try this API query
curl https://beemaps.com/api/developer/aievents/6976a302e13d2ed988573033\ ?includeGnssData=true\&includeImuData=true \ -H "Authorization: Basic <your-api-key>"
A video-only dataset would label this as routine freeway driving. But the deceleration profile — and the GNSS altitude climbing 30 meters over the clip as the vehicle ascends the Cahuenga Pass — tells a different story. For a world model learning braking physics on grades, this is exactly the kind of sample that matters.
Internet-scraped datasets like OpenDV-YouTube provide 1,700 hours of driving footage with none of this grounding. Simulation can provide synthetic sensor streams, but suffers from a sim-to-real gap in texture statistics, lighting, and material properties — worst precisely for the rare events that matter most.
The four requirements for world-model training data
If you take the DrivingDojo results seriously — and we think you should, given the venue and experimental clarity — then the requirements for world-model training data become fairly clear:
| # | Requirement | What it means |
|---|---|---|
| 1 | Diverse action distributions | Substantial representation of turning, braking, merging, yielding, and recovery — not just constant-velocity forward driving. A hard requirement for action-conditioned generation to work at all. |
| 2 | Synchronized multi-modal sensors | Video paired with IMU (6-axis), GNSS (10Hz+), and calibrated speed. The difference between learning pixel correlations and learning dynamics. |
| 3 | Geographic breadth | Structurally different driving environments — varied road geometries, traffic densities, regulatory regimes. Variations within a single metro area are insufficient. |
| 4 | Continuous long-tail capture | Rare events emerging organically from large fleet-hours, not scripted scenarios. An event detection and indexing system to make them tractable. |
DrivingDojo satisfied (1) and partially (4) from a single fleet in a single country — and that alone was sufficient to transform model quality.
Nobody has yet tested what happens when you satisfy all four simultaneously. We don't know exactly how the gains scale. But the direction of the DrivingDojo result is clear, and we think the diversity-quality relationship has a long way to go before it saturates.
What this looks like in practice
Bee Maps has mapped 37% of the global road network with a dashcam fleet spanning dozens of countries — each camera continuously recording and classifying these moments. The AI Event Videos API indexes them by type and makes them queryable with full sensor context. Three examples — each mapping to a different requirement:
Three events from three countries — and the API surfaces thousands more. Missed-turn reversals on European residential streets, animal evasions on rural Portuguese roads, highway near-collisions near Porto. Every one captured with synchronized video, GNSS, IMU, and calibrated speed — same data schema, same API endpoint.
The data infrastructure DrivingDojo built for one country is, in many ways, already operating globally.
The scaling question
What makes DrivingDojo one of the more important results in the world model literature is that it isolates the variable that matters. The architectures aren't novel. The training procedures are standard. What changed was the training distribution — and that alone was enough to transform model behavior from broken to functional.
This suggests something that we find compelling: if diversity in one country's fleet data produces this kind of improvement, what happens with fleet data from dozens of countries? If 7,500 hours yields these gains, what about 75,000?
We don't have a definitive answer yet. But the direction seems clear, and we believe the teams that figure out the data supply chain for world models — not just the architectures, not just the compute, but the continuous pipeline of geographically diverse, sensor-grounded, long-tail-enriched video — will be the ones that build models that actually generalize to the physical world.

