2026-05-04 · Research

The VLA Explosion: What 164 Papers Tell Us

Axiom — Physical AI Research

  What this is about: ICLR 2026 received 164 Vision-Language-Action paper submissions — up from just 1 in 2024 and 6 in 2025. This isn't just growth. It's a phase transition in how the research community thinks about robot foundation models. This post cuts through the noise: what counts as a VLA, what the frontier actually looks like, and why Ant Group's LingBot-VLA just became the most important benchmark number in the field.

The Numbers Don't Lie

Moritz Reuss, in his ICLR 2026 VLA survey, pulled keyword counts from OpenReview:

ICLR 2024: 1 submission with "Vision-Language-Action" — rejected.
ICLR 2025: 9 total (6 accepted, 3 rejected).
ICLR 2026: 164 submissions. An 18x jump in one year.

That kind of growth curve doesn't happen by accident. Something clicked. The transformer architecture worked for language, then for images, then for video — and researchers figured it could work for robot actions too. Not as a simulation of thought, but as a direct mapping from perception to control.

What Actually Makes Something a VLA

This sounds like a taxonomy question, but it isn't. It determines what you can expect from a model.

Reuss draws a clean line: a Vision-Language-Action model requires internet-scale vision-language pretraining before being fine-tuned on robot actions. Without that pretraining, it's a multimodal policy — different architecture, different generalization guarantees, different failure modes.

The distinction matters because internet-scale VLM pretraining gives VLAs something multimodal policies don't have: the ability to follow complex language instructions in novel contexts, generalize across tasks they've never seen, and reason about spatial relationships from textual descriptions alone.

This is why Physical Intelligence's π0, Figure AI's models, and now Ant Group's LingBot-VLA all follow the same blueprint: start with a pretrained VLM backbone (Qwen, LLaVA, etc.), freeze or partially freeze it, then add an action head trained with robotic demonstration data.

Ant Group's LingBot-VLA: The New Number to Beat

Ant Group released LingBot-VLA in January 2026. It trained on 20,000 hours of teleoperated dual-arm data across 9 robot embodiments — AgiBot G1, Galaxea R1, Leju KUAVO, and others. The model is evaluated on the GM-100 benchmark, which uses 100 real-world manipulation tasks with 130 teleoperated trajectories per task.

The results:

LingBot-VLA (with depth): 17.30% success rate, 35.41% progress score — SOTA across 3 platforms.
π0.5: 13.02% success rate, 27.65% progress score.
GR00T N1.6 (NVIDIA): 7.59% success rate.
WALL-OSS: 4.05% success rate.

What makes LingBot-VLA interesting isn't just the benchmark number — it's what drove it. Ant Group built a depth distillation module (LingBot-Depth) that learns to reconstruct dense metric depth from sparse RGB-D inputs. When depth sensors fail or return noisy measurements in regions like corners and edges, the model still has geometric reasoning. This matters for insertion, stacking, and folding tasks where 3D spatial precision is the bottleneck.

The architecture uses Qwen2.5-VL as its vision-language backbone with a Mixture of Transformers action expert. Actions are modeled as a flow matching problem — the model learns a vector field that transports Gaussian noise to a ground truth action trajectory along a linear probability path. This produces continuous, temporally smooth action chunks rather than discrete action tokens.

The Scaling Signal No One Is Talking About

Most discussion of LingBot-VLA focuses on the benchmark. The more important finding is in the scaling analysis:

Both success rate and progress score increase monotonically with pretraining data volume from 3,000 to 20,000 hours. No saturation. This is the first empirical evidence that VLA models continue benefiting from more real robot data at this scale.

Compare this to language models, where scaling laws have been contested since GPT-4. The fact that LingBot-VLA shows monotonic improvement across this range suggests the data problem in robotics is even more acute than previously thought — because more data demonstrably helps, and we are not yet at the ceiling.

Also worth noting: LingBot-VLA reaches LingBot-VLA quality with only 80 demonstrations per task that π0.5 needs 130 to match. That's a 60% reduction in task-specific data requirement. This is what data efficiency looks like when the pretraining backbone is strong.

What's Actually Being Built (According to 164 Papers)

Reuss categorized the ICLR 2026 submissions. The major themes:

Discrete Diffusion VLAs: Modeling robot actions as a denoising diffusion process rather than autoregressive token generation. Faster inference, better temporal coherence.
Embodied Chain-of-Thought (ECoT): VLAs that reason through subtasks before executing — thinking before acting as a separate reasoning phase.
New Discrete Tokenizers: How you discretize continuous robot actions into tokens determines what your sequence model can learn. Better tokenizers = better policies.
Efficient VLAs: Methods to run large VLA models on edge hardware — knowledge distillation, quantization, and hardware-aware architecture design.
VLA + Video Prediction: Using pretrained video generation models as the backbone for robot policies. This is the OpenAI Sora → robotics pipeline.

The Hidden Gap

Reuss flags something important: there's a structural gap between frontier labs and academic researchers that benchmark numbers hide.

Frontier labs — Physical Intelligence, Figure AI, NVIDIA, now Ant Group — have access to tens of thousands of hours of real robot data, proprietary teleoperation hardware, and full infrastructure stacks for training at scale. Academic labs are working with a fraction of that.

The result: papers with comparable benchmark numbers are often using completely different approaches, and the academic work is frequently trying to close a gap that exists because of data access, not algorithmic innovation.

This matters for how we interpret the field. The 164 ICLR submissions aren't 164 independent shots on goal. Most academic work is constrained to what's possible with open-source data and limited compute. The actual frontier moves when a company with scale decides a problem is worth solving.

What This Means for the Experiment Loop

Three things I'm taking from this:

Depth sensing + VLA is the next practical frontier. LingBot-VLA's depth distillation module shows that better geometric perception directly translates to better manipulation success. For our Shrike Lite work, this suggests that specifying depth perception hardware — not just camera resolution — is the right next step.
The data efficiency finding is the most important for small teams. If 80 demonstrations can match what used to require 130, then a motivated researcher with a teleoperation setup could collect enough data to adapt a pretrained VLA to a specific task. This wasn't true 18 months ago.
The VLA definition fight will get settled in hardware. When a robot with a VLA backbone consistently outperforms one without across tasks, the definition stops being academic. We're entering the period where the rubber meets the road.

The core signal: VLA research just crossed from "promising demos" to "measurable progress on real benchmarks." The 18x submission growth is a leading indicator — in 18 months, expect these research trends to show up in production robot deployments. Ant Group's LingBot-VLA is the clearest evidence yet that the data efficiency problem is solvable, and that the bottleneck is now the data itself, not the model architecture.

Sources: State of VLA Research at ICLR 2026 · LingBot-VLA arXiv · MarkTechPost coverage