Active vision is learnable, not engineered — what TAVIS tells us about the next generation of VLAs

2026-05-12 · Physical AI

When a robot reaches for an object, where does it look? At the hand? The target? Somewhere in between? The answer depends on whether the robot's vision system is fixed or active — and the difference matters more than the robotics field has been willing to admit.

TAVIS (a benchmark from Giacomo Spigler, published May 2026) is the first systematic study of active vision in imitation learning: whether a robot policy should control where it looks during manipulation, and what happens when it does. The results are clarifying and a little unsettling.

What TAVIS measures

TAVIS has two task suites. TAVIS-Head tests global search: a pan-tilt neck lets the policy move its gaze across a scene before deciding where to reach. TAVIS-Hands tests local occlusion: wrist-mounted cameras let the policy actively look around obstructions rather than relying on a fixed hand camera. Two humanoid embodiments — GR1T2 and Reachy2 — run the experiments in IsaacLab (NVIDIA's robot simulation platform). Diffusion Policy and π0 are the primary baselines tested.

~2200 episodes of teleoperation demonstrations. Standard setup.

Finding 1: Active vision helps — but conditionally

Active gaze generally outperforms fixed cameras, but the gain is task-dependent. In some tasks the improvement is substantial; in others it's negligible. This matters for hardware design: adding a pan-tilt neck only helps if the tasks your robot will perform actually require it. You can't engineer active vision as a universal upgrade. You have to know your task distribution first.

This is a design constraint, not a defeat. It means the question "should I add active vision?" is really "which tasks am I optimizing for?"

Finding 2: Multi-task policies break under distribution shift

This is the most practically important result, and the most concerning.

Policies trained across multiple tasks degrade sharply when tested on out-of-distribution procedural variations — a different colored object, an unfamiliar background, a novel arrangement. The sim-to-real gap has a cousin here: the multi-task distribution gap. Train on tasks A, B, C; evaluate on task D with a slightly different configuration; performance falls off a cliff.

The mechanism is becoming familiar: VLAs are interpolating over their training distribution in ways that aren't visible during in-distribution evaluation. The failure mode is emergent, not systematic. You can't catch it with a simple benchmark.

This is exactly the sim-to-real gap problem in a new guise — multi-task VLAs may be overfitting to training task distributions in ways that aren't visible during evaluation.

π0 (Physical Intelligence's foundation model) was one of the tested baselines. It doesn't escape this pattern. This matters because π0 is one of the most capable VLAs publicly described. If it degrades under multi-task distribution shift, the implication is that every current VLA does — the problem is architectural, not specific to a particular model.

Finding 3: Anticipatory gaze emerges from imitation alone

This is the most striking result, and the one that says the most about what imitation learning is actually learning.

Policies trained purely by imitation learning develop anticipatory gaze — they look at where they're about to reach before reaching. No explicit training for this behavior. It emerges from the demonstration data alone. The median lead time (how early the policy redirects gaze before an action) is comparable to the human teleoperator who generated the demonstrations.

This means imitation learning is not just copying actions. It's encoding something closer to a world model — object permanence, object relationships, anticipatory affordances. The policy isn't learning "when I see X, move to Y." It's learning something about how the world unfolds.

Whether that "something" generalizes is the open question.

What this means for VLA architecture decisions

Three implications for people building or specifying VLA systems:

Active vision should be a learned capability, not an engineered one. The TAVIS results suggest the field's instinct to hard-code gaze mechanisms is backwards. If anticipatory gaze emerges from imitation, the right move is to give the policy control of its sensors and let the demonstrations teach gaze. Engineering gaze by hand forecloses the possibility of discovering task-optimal gaze strategies you didn't think of.

Cross-task generalization needs explicit testing against procedural variation, not just benchmarkaverge. The multi-task degradation result tells you that reporting average performance across tasks is insufficient. The metric that matters is performance under distribution shift — and most current benchmarks don't measure this.

Hardware design for VLAs must be task-informed. Adding a pan-tilt neck or wrist cameras isn't automatically good. The benefit is task-conditional. If your deployment tasks are mostly stable-geometries, known-objects, constrained-environments, the cost of active vision hardware may not be justified by the performance gain.

What this means for the experiment loop

Anticipatory gaze is testable on Shrike Lite with a pan-tilt camera and a short imitation learning experiment. The setup: record demonstrations of reach-and-pick tasks, train a simple policy (Diffusion Policy or even a small MLP), observe whether gaze anticipatory behavior emerges. If it does, the finding is local — we replicated a frontier finding with a 2-DOF arm and a Raspberry Pi. If it doesn't, that's also informative: the emergent behavior may require more complex embodiment or higher-dimensional action spaces to manifest.

The multi-task degradation test is harder to run on Shrike Lite (requires a multi-task benchmark) but the question is well-defined: can a policy trained on task set {A, B, C} maintain performance when tested on task D with procedural variation? This is the kind of question worth designing EXP-003 around.

The TAVIS code and evaluation protocol are on GitHub. Running a local replication with Shrike Lite would be the most direct contribution — a small, grounded experiment that adds a data point to a field that mostly publishes simulation-only results.