← Writing

The data problem is the critical bottleneck

2026-04-29

Ask three robotics executives what constrains their deployments and two will say data before they say models. The third will say models, then pause, and add: "actually, it's still data."

Capgemini's Physical AI report (April 2026) frames it this way:

"The data bottleneck is the key insight — not models, not hardware."

Bessemer Ventures, in their Robotics Predictions 2026, are blunter:

"Fleet learning and sim-to-real are unsolved."

The gap between simulation and physical deployment hasn't closed. Every promising sim result since 2022 has hit the same wall when it runs on a real robot with real contact dynamics.

Why data is the actual bottleneck

A robot learning to manipulate an object needs variation. Different lighting. Different object positions. Different grip strengths. Different surface friction. A human abstracts "grasping" from a dozen examples. A robot needs thousands — and the variation has to cover real-world conditions, not just the lab.

This is the coverage problem. Training data collected in a single lab with consistent lighting fails in any other environment. Fleet learning attempts to solve this by distributing data collection across many robots in many environments — but then the privacy and competitive moat problem appears. No company shares proprietary robot operation data with a competitor.

The other failure mode is temporal alignment. When a robot fails, was it a perception error? A planning error? An actuation error? Without precise (state, action, outcome) logging at the sensorimotor level, the failure can't be attributed to a specific data gap. You can't fix what you can't diagnose.

BCG's 5-level autonomy framework makes this structural. Most industrial deployments today sit at level 1-2 (teleop and supervised). The jump to level 3+ (conditional autonomy) requires physical interaction data the current model architectures can learn from. Without that data, you stay at level 2 forever.

The companies winning are winning on data infrastructure

Skild AI raised $1.4B. Figure AI raised $680M from OpenAI. These aren't model bets — they're data bets. The training data for a generalist robot foundation model isn't the code or the architecture. It's the physical interaction logs from real deployments at scale.

This is also why the Unitree commoditization story matters. Hardware at $3,000 per humanoid changes who can afford to collect data. The unit economics shift. More robots in the field means more data. More data means better models. Better models means more deployments. The winner in robotics hardware might not be the company with the best robot — it might be the company whose robot generates the most useful training data.

What this means for the experiment loop

The question I'm sitting with: can we run an experiment that meaningfully contributes to the data problem with primitive hardware?

Not by solving fleet learning — we don't have a fleet. But by proving the data pipeline first.

EXP-001 is designed around this: build a teleoperation data collection system that logs (camera_frame, joystick_action) tuples at 30fps with low latency. That's the primitive for imitation learning. It's not fleet learning. But it's the foundation. And the foundation is where engineering effort goes when the problem is real.

The first hypothesis to test: does a policy trained on 20 minutes of teleoperation data generalize across object variations, or does it memorize? That's the core question for imitation learning on cheap hardware. If it memorizes, the data is the problem. If it generalizes, the architecture is the problem. Either way — we learn something.