You Can't Manufacture a Near-Miss

Training a world foundation model is really two distinct problems that people tend to collapse into one. The first — pre-training — is about teaching a model how the physical world generally works. The second — post-training — is about making that understanding specific enough to be useful. These require fundamentally different data. Neither is solved for driving.

Pre-training is not solved. It just looks that way.

The conventional take is that pre-training is a volume problem — throw enough video at the model and it learns physics. This is roughly true for general world models. But for domain-specific pre-training of driving world models, bulk video is actively misleading.

Most available driving video looks the same. Sunny highways. Uneventful lane-keeping. Straight roads in good conditions. A model pre-trained on this develops strong priors for the 99% case and almost no understanding of the 1% that actually matters — harsh braking, near-misses with vulnerable road users, unusual road geometries, weather-degraded visibility, emergency vehicle encounters.

To put real numbers on this: across the Bee network, harsh braking events occur roughly once every 8 hours of driving. Near-misses with pedestrians or cyclists show up once every 40-50 hours. Swerving or evasive maneuvers are even rarer. In a bulk dashcam dataset of 100,000 hours, you might find 12,000 harsh braking clips and 2,000 VRU near-misses — buried under 85,000+ hours where nothing happens. That's the composition problem. The events that matter most are the ones least represented.

Here's what that 1% actually looks like:

A driver in San Francisco nearly hits a pedestrian, braking from 22 mph to a dead stop

A cyclist forces an evasive maneuver in central London

A near miss at a roundabout in the UK

These aren't staged. They're real events captured by Bee dashcams running edge AI in the field. A bulk dashcam dataset with 900,000 hours of highway footage might contain a handful of moments like these — buried under hundreds of thousands of hours where nothing happens. A world model trained on that corpus will learn to predict straight roads in good weather with high confidence and have almost no basis for simulating the scenarios that actually kill people.

Some researchers argue that sufficiently general representations should transfer — that a model trained on enough normal driving will generalize to edge cases it hasn't seen. For non-critical applications, maybe. But in driving, where a single hallucinated frame can represent a fatal planning error, statistical coverage of rare event types during pre-training isn't optional. A world model that's never seen a cyclist swerve into traffic won't reliably simulate it during post-training. The foundation has to be there.

Domain pre-training — pre-training a world model specifically on high-quality, diverse driving data rather than generic internet video — is where the real opportunity lies. This isn't a volume play. It's a quality and diversity play.

Post-training is where it gets specific

Post-training is where a foundation model goes from understanding driving generally to simulating specific scenarios with physical plausibility. A cyclist cutting across traffic in Ho Chi Minh City. A sudden lane closure on wet pavement in Munich. A child running between parked cars in Lagos.

The same event type plays out completely differently depending on geography:

A near miss and emergency stop at a red light in Mexico City

A driver in Romania doing 113 mph on a regional highway

A model fine-tuned on US highway data won't understand Romanian road behavior, and vice versa. The physics are the same. The driving culture, road geometry, and risk profiles are not.

The data requirements shift. Post-training data needs geo-location per frame, semantic annotations with a consistent ontology, temporal event labels marking exactly when something interesting happens, and freshness — a model fine-tuned on video of a road rebuilt eight months ago is learning physics that no longer apply.

Post-training also happens continuously. Pre-training might happen once or twice a year. Post-training happens every time a team adapts a model to a new geography, a new edge case, a new customer. It's the recurring cost center and the recurring data bottleneck.

How existing datasets fall short

The public driving datasets available today were built for perception, not world model training. Their limitations are structural:

Dataset	Total Hours	Countries	Edge-Case Events	Geo-Tagged	Refreshed
nuScenes	~5.5	2	~200 annotated scenes	Yes	No (2019)
Waymo Open	~10	1	Sparse	Yes	No (2019-2023)
BDD100K	~1,100	1	Minimal labeling	Partial	No (2018)
OpenDV-YouTube	~1,700	Mixed	Unstructured	No	No (2023)
DrivingDojo	~7,500	1	Moderate	Yes	No (2024)
Bee Network	8,000+	30+	Continuously detected	Per-frame GNSS	Daily

nuScenes and Waymo Open are small by modern standards — a few hours of carefully curated driving. BDD100K has scale but minimal event labeling and US-only coverage. OpenDV-YouTube has geographic diversity scraped from the internet but no structured metadata, no geo-tags, and no event annotations. DrivingDojo showed that better data composition fixes the controllability problem, but it's a static Chinese-city dataset that isn't available to other teams.

None of these were designed to refresh. The physical world changes — roads get rebuilt, intersections reconfigured, new construction zones appear — and a static dataset can't track that.

The industry is starting to figure this out

There's an interesting signal from the broader AI industry that's worth examining carefully. OpenAI offered $500 million last year to acquire Medal, a platform where gamers upload clips of their gameplay. The deal didn't close, and Medal instead spun out an AI lab called General Intuition, which raised $134 million in seed funding — Khosla Ventures' largest seed check since OpenAI.

The underlying logic is worth unpacking. Gamers don't share random footage. They share edge cases — spectacular wins, catastrophic failures, the moments that break from routine. Medal's CEO noted that this selection bias is "precisely the kind of data you actually want to use for training." The dataset has a natural enrichment toward spatially and temporally complex moments, which is the opposite of what you get from passive bulk collection.

This seems directionally right to us, and it mirrors what we've observed in driving data. But it also highlights an important gap. Gaming environments, however complex, are still rendered worlds with known physics engines. The transfer from Fortnite to a wet intersection in São Paulo is not straightforward. General spatial reasoning is necessary but not sufficient for the driving domain — you also need the messiness of real road surfaces, real human decision-making, real weather degradation, and the long tail of physical situations that no game engine simulates.

Tesla has probably the closest thing to an ideal driving dataset — billions of miles from millions of vehicles with full sensor suites. But that data serves Tesla's autonomy stack exclusively. It's a closed competitive advantage, not infrastructure for the field. For everyone else working on driving world models, the situation is structurally similar to where language models were before large-scale open text corpora existed: the people who need the data most have the least access to it, and what's publicly available has the wrong distribution.

What the ideal training data source looks like

The requirements are clear. For domain pre-training, you need video that's weighted toward rare events — not a corpus where 99.5% of frames are uneventful highway driving. For post-training, you need geo-tagged, event-labeled clips that can be queried by geography, event type, and time period without reprocessing raw video. And both require freshness — the physical world changes, and stale data teaches stale physics.

The only way to get this at scale is a collection mechanism with built-in selection bias toward the interesting, running continuously, across diverse geographies. Not a one-time data collection campaign. A persistent network.

What AI Event Videos solve

We built AI Event Videos to be this data layer — from domain pre-training through post-training and evaluation.

Every Bee dashcam in the network runs edge AI that detects and captures specific events in real time — harsh braking, VRU proximity, sudden stops, unusual driving patterns. When something interesting happens, it captures the full video context around the event. Not a frame. The continuous sequence.

The mechanism is structurally identical to what makes Medal's data valuable: selection bias toward the interesting. But instead of gamers choosing which clips to upload, edge AI on 75,000+ dashcams across 30+ countries is detecting events as they happen on real roads.

Multi-agent interaction at highway speed — a driver weaving between trucks at 85 mph on I-580 in Oakland. A world model needs to understand relative velocity, lane prediction, and the physics of rapid lane changes.

120 mph on I-95 in Delaware — extreme speed that stress-tests whether a model has learned realistic vehicle dynamics or just straight-line extrapolation

Aggressive evasive maneuvers on a Florida highway — sudden directional changes at speed

Every one of these was detected and captured automatically by edge AI — no human review, no manual labeling, no data collection campaign. The network just runs.

Each clip ships with:

GNSS coordinates per frame. Want 5,000 harsh braking events from Southeast Asian intersections? European roundabouts? Unpaved roads in Sub-Saharan Africa? It's an API call.

Temporal event annotations. Tagged at the edge in real time. What happened, when, how severe.

IMU and motion data. Ego-motion, acceleration, orientation — useful for quality filtering and distinguishing real events from sensor noise.

Freshness in days, not years. The network captures continuously. The data refreshes itself because the devices are already on the road.

For domain pre-training, this means you can compose a training corpus that's weighted toward the events that actually matter. You're not just adding more data. You're adding the data that changes what the model learns.

For post-training, the geo-tags and event labels mean you can construct fine-tuning datasets for specific domains without re-processing raw video. Need to fine-tune for Indian urban driving? Query by geography + event type + time of day.

Freshness is a moat

The physical world changes. Roads get rebuilt. Construction zones appear and disappear. Lane configurations shift. A world model trained on a static dataset is slowly going stale, and you won't know it until it generates something physically wrong.

Traditional data collection is a campaign. You hire a fleet, drive routes, collect, label, and by the time you've processed it, the world has moved on. The labeled intersection from six months ago has a new traffic light.

The network captures continuously as a byproduct of vehicles already driving. Near-zero marginal collection cost. New events flow in daily. Both pre-training refreshes and post-training runs can draw from data that's days old, not years.

Where this goes

The world model race won't be decided by compute or architecture alone. The teams that crack the data composition problem — continuous access to rare events, at global scale, structured for training, refreshed daily — will be the ones that ship models worth trusting on real roads.

We're expanding event type coverage, adding new sensor modalities, and scaling the network into geographies that are completely absent from existing datasets — Sub-Saharan Africa, Southeast Asia, the Middle East. The most dangerous roads in the world are also the most underrepresented in training data. That's the gap we're closing.

If you're building a driving world model and the data composition problem is real for your team, the AI Event Videos API is live today. Try it in the API playground or reach out directly — we'd rather talk to you about what you need than guess.