Moritz Reuss, in his ICLR 2026 VLA survey, pulled keyword counts from OpenReview:
That kind of growth curve doesn't happen by accident. Something clicked. The transformer architecture worked for language, then for images, then for video — and researchers figured it could work for robot actions too. Not as a simulation of thought, but as a direct mapping from perception to control.
This sounds like a taxonomy question, but it isn't. It determines what you can expect from a model.
Reuss draws a clean line: a Vision-Language-Action model requires internet-scale vision-language pretraining before being fine-tuned on robot actions. Without that pretraining, it's a multimodal policy — different architecture, different generalization guarantees, different failure modes.
The distinction matters because internet-scale VLM pretraining gives VLAs something multimodal policies don't have: the ability to follow complex language instructions in novel contexts, generalize across tasks they've never seen, and reason about spatial relationships from textual descriptions alone.
This is why Physical Intelligence's π0, Figure AI's models, and now Ant Group's LingBot-VLA all follow the same blueprint: start with a pretrained VLM backbone (Qwen, LLaVA, etc.), freeze or partially freeze it, then add an action head trained with robotic demonstration data.
Ant Group released LingBot-VLA in January 2026. It trained on 20,000 hours of teleoperated dual-arm data across 9 robot embodiments — AgiBot G1, Galaxea R1, Leju KUAVO, and others. The model is evaluated on the GM-100 benchmark, which uses 100 real-world manipulation tasks with 130 teleoperated trajectories per task.
The results:
What makes LingBot-VLA interesting isn't just the benchmark number — it's what drove it. Ant Group built a depth distillation module (LingBot-Depth) that learns to reconstruct dense metric depth from sparse RGB-D inputs. When depth sensors fail or return noisy measurements in regions like corners and edges, the model still has geometric reasoning. This matters for insertion, stacking, and folding tasks where 3D spatial precision is the bottleneck.
The architecture uses Qwen2.5-VL as its vision-language backbone with a Mixture of Transformers action expert. Actions are modeled as a flow matching problem — the model learns a vector field that transports Gaussian noise to a ground truth action trajectory along a linear probability path. This produces continuous, temporally smooth action chunks rather than discrete action tokens.
Most discussion of LingBot-VLA focuses on the benchmark. The more important finding is in the scaling analysis:
Both success rate and progress score increase monotonically with pretraining data volume from 3,000 to 20,000 hours. No saturation. This is the first empirical evidence that VLA models continue benefiting from more real robot data at this scale.
Compare this to language models, where scaling laws have been contested since GPT-4. The fact that LingBot-VLA shows monotonic improvement across this range suggests the data problem in robotics is even more acute than previously thought — because more data demonstrably helps, and we are not yet at the ceiling.
Also worth noting: LingBot-VLA reaches LingBot-VLA quality with only 80 demonstrations per task that π0.5 needs 130 to match. That's a 60% reduction in task-specific data requirement. This is what data efficiency looks like when the pretraining backbone is strong.
Reuss categorized the ICLR 2026 submissions. The major themes:
Reuss flags something important: there's a structural gap between frontier labs and academic researchers that benchmark numbers hide.
Frontier labs — Physical Intelligence, Figure AI, NVIDIA, now Ant Group — have access to tens of thousands of hours of real robot data, proprietary teleoperation hardware, and full infrastructure stacks for training at scale. Academic labs are working with a fraction of that.
The result: papers with comparable benchmark numbers are often using completely different approaches, and the academic work is frequently trying to close a gap that exists because of data access, not algorithmic innovation.
This matters for how we interpret the field. The 164 ICLR submissions aren't 164 independent shots on goal. Most academic work is constrained to what's possible with open-source data and limited compute. The actual frontier moves when a company with scale decides a problem is worth solving.
Three things I'm taking from this: