VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

Han Dong; Haoyuan Shi; Haozhe Shan; Jiayu Hu; Jinpeng Lu; Qinfan Zhang; Xiancong Ren; Xiaozhu Ju; Yinda Chen; Yingji Zhang

arxiv: 2605.30117 · v1 · pith:CMWVBVGAnew · submitted 2026-05-28 · 💻 cs.AI

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

Haoyuan Shi , Xiancong Ren , Yingji Zhang , Qinfan Zhang , Jiayu Hu , Haozhe Shan , Han Dong , Jinpeng Lu

show 4 more authors

Yinda Chen Yi Zhang Yong Dai Xiaozhu Ju

This is my paper

Pith reviewed 2026-06-29 07:42 UTC · model grok-4.3

classification 💻 cs.AI

keywords vision-language-action modelsmodel diagnosisrepresentation tracingattention interventionsembodied controlmultimodal routingsemantic followingaction generation

0 comments

The pith

VLA-Trace shows two vision-language-action models adapt modalities differently, route inputs variably across layers, and generate visually grounded actions but fall short on fine semantic instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VLA-Trace to diagnose how vision-language-action models turn multimodal inputs into embodied actions by building an evidence chain from representation changes during training to causal control pathways to observable behavior. It applies cross-modal and drift measures to track internal evolution, targeted interventions to isolate which connections drive outputs, and task rollouts to check grounding versus shortcut use. On the two models tested, this reveals each one changes its vision and language handling in its own pattern, combines the inputs through different layer strategies, and produces reliable visual trajectories while often missing precise language details. A sympathetic reader would care because these patterns identify concrete places where current models lose or ignore information before it reaches control.

Core claim

VLA-Trace establishes that the examined models exhibit distinct modality-specific adaptation dynamics during finetuning, rely on different multimodal routing strategies and layer-wise dependencies during action decoding, and excel at visually grounded trajectory generation while remaining limited in fine-grained semantic following.

What carries the argument

VLA-Trace, a progressive diagnostic framework that chains representation evolution tracking with attention knockout for causal pathway identification and rollout probes for behavioral assessment of grounding and semantic dependence.

If this is right

Adaptation procedures during finetuning should be adjusted to preserve useful modality-specific representations rather than overwrite them uniformly.
Explicit design of multimodal routing circuits could reduce unwanted layer-wise dependencies in action decoding.
Policy training objectives need additions that enforce finer semantic adherence beyond visual trajectory matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tracing steps could be applied to other sequential or decision-making multimodal systems to check for similar adaptation patterns.
The observed semantic shortfall may stem from training data imbalances that favor visual signals over language precision, a factor that targeted data changes could test.
Expanding the behavioral probes to longer-horizon or more cluttered environments would reveal whether the visual grounding advantage holds outside the current test setups.

Load-bearing premise

The attention knockout interventions and behavioral probes isolate the models' true causal control pathways and semantic dependencies without introducing confounding artifacts from the intervention methods or rollout environments.

What would settle it

Re-running the attention knockouts on the identified pathways produces no change in the generated actions, or the models achieve comparable performance on fine-grained semantic following tasks as on visual trajectory tasks under matched conditions.

Figures

Figures reproduced from arXiv: 2605.30117 by Han Dong, Haoyuan Shi, Haozhe Shan, Jiayu Hu, Jinpeng Lu, Qinfan Zhang, Xiancong Ren, Xiaozhu Ju, Yinda Chen, Yingji Zhang, Yi Zhang, Yong Dai.

**Figure 2.** Figure 2: Layer-wise image–text CKA across datasets and training stages. The panels compare cross-modal [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Stage-wise checkpoint-drift CKA for π0.5 and OpenVLA. Each cell reports the mean matched-layer CKA for vision, text, and joint representations. tervening at these two stages, we can distinguish whether performance gains arise from better multimodal understanding or more effective utilization of modality-specific information during generation. Three attention configurations are considered: i. Baseline, whe… view at source ↗

**Figure 4.** Figure 4: Attention knockout for π0.5 (top) and OpenVLA (bottom) on LIBERO-10. Similar observations for LIBERO-Goal, Object, and Spatial are provided in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: π0.5 attention IoU and mass on LIBERO-10. (a) Object IoU dynamically shifts between the first (Phase 1) and second (Phase 2) instruction subgoals. (b) Attention mass allocation over robot and object regions. These results indicate that VLA policies successfully generate visually grounded trajectories by tracking task-relevant objects over time. See similar observations in [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 6.** Figure 6: Layer-wise knockout results for π0.5 on LIBERO-10, Goal, Object, and Spatial. Each point reports the success rate under a 3-layer knockout window centered at the indicated layer. 0 4 8 12 17 0 25 50 75 100 LIBERO-10 Success rate (%) a base 0 4 8 12 17 b base 0 4 8 12 17 c base 0 4 8 12 17 d base 0 4 8 12 17 e base 0 4 8 12 17 f base 0 4 8 12 17 g base 0 4 8 12 17 h base 0 4 8 12 17 0 25 50 75 100 Goal Succ… view at source ↗

**Figure 7.** Figure 7: Layer-wise knockout results for π0.5 on LIBERO-10, Goal, Object, and Spatial. Each point reports the success rate under a 5-layer knockout window centered at the indicated layer. 0 4 8 12 17 0 25 50 75 100 LIBERO-10 Success rate (%) a base 0 4 8 12 17 b base 0 4 8 12 17 c base 0 4 8 12 17 d base 0 4 8 12 17 e base 0 4 8 12 17 f base 0 4 8 12 17 g base 0 4 8 12 17 h base 0 4 8 12 17 0 25 50 75 100 Goal Succ… view at source ↗

**Figure 8.** Figure 8: Layer-wise knockout results for π0.5 on LIBERO-10, Goal, Object, and Spatial. Each point reports the success rate under a 7-layer knockout window centered at the indicated layer [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Layer-wise knockout results for [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Layer-wise knockout results for OpenVLA on LIBERO-10, Goal, Object, and Spatial. Each point reports the success rate under a 7-layer knockout window centered at the indicated layer. 0 8 16 24 31 0 25 50 75 100 LIBERO-10 Success rate (%) a base 0 8 16 24 31 b base 0 8 16 24 31 c base 0 8 16 24 31 d base 0 8 16 24 31 e base 0 8 16 24 31 f base 0 8 16 24 31 g base 0 8 16 24 31 h base 0 8 16 24 31 0 25 50 75 … view at source ↗

**Figure 11.** Figure 11: Layer-wise knockout results for OpenVLA-OFT on LIBERO-10, Goal, Object, and Spatial. Each point reports the success rate under a 3-layer knockout window centered at the indicated layer [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Layer-wise knockout results for OpenVLA-OFT on LIBERO-10, Goal, Object, and Spatial. Each point reports the success rate under a 5-layer knockout window centered at the indicated layer. 0 8 16 24 31 0 25 50 75 100 LIBERO-10 Success rate (%) a base 0 8 16 24 31 b base 0 8 16 24 31 c base 0 8 16 24 31 d base 0 8 16 24 31 e base 0 8 16 24 31 f base 0 8 16 24 31 g base 0 8 16 24 31 h base 0 8 16 24 31 0 25 50… view at source ↗

**Figure 13.** Figure 13: Layer-wise knockout results for OpenVLA-OFT on LIBERO-10, Goal, Object, and Spatial. Each point reports the success rate under a 7-layer knockout window centered at the indicated layer [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Layer-wise knockout results for OpenVLA-OFT on RoboTwin tasks. Each row corresponds to one RoboTwin task, and each point reports the success rate under a 1-layer knockout window at the indicated layer. 0 8 16 24 31 0 25 50 75 100 Click Alarm Clock Success rate (%) a base 0 8 16 24 31 b base 0 8 16 24 31 c base 0 8 16 24 31 d base 0 8 16 24 31 e base 0 8 16 24 31 f base 0 8 16 24 31 g base 0 8 16 24 31 h b… view at source ↗

**Figure 15.** Figure 15: Layer-wise knockout results for OpenVLA-OFT on RoboTwin tasks. Each row corresponds to one RoboTwin task, and each point reports the success rate under a 3-layer knockout window at the indicated layer [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Schematic illustration of attention-IoU computation. We project action-conditioned attention onto the [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative action-to-image attention visualizations across rollout stages. We compare pretrained and [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Token-wise text-to-image attention for pretrained and fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

**Figure 19.** Figure 19: Token-wise text-to-image attention for pretrained and fine-tuned OpenVLA across execution steps. [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: Visualization of action-to-text attention at different timesteps (top to bottom: steps 0, 30, 60, 100, and [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 21.** Figure 21: Visualization of layer-wise modality attention (left: step 30, right: step 150) (Top: pretrained, bottom: [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

**Figure 22.** Figure 22: OpenVLA attention IoU and mass on LIBERO-10. (a) Object IoU dynamically shifts between the [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗

read the original abstract

Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to causal control attribution and behavioral manifestation. It specifically combines cross-modal and checkpoint-drift centered kernel alignment (CKA) to trace representation evolution, attention knockout interventions to identify modality-specific control pathways, and rollout-level behavioral probes to examine grounding, shortcut dependence, and semantic following. Experiments on $\pi_{0.5}$ and OpenVLA reveal three key findings. First, the two models exhibit distinct modality-specific adaptation dynamics during VLA finetuning. Second, they rely on different multimodal routing strategies and layer-wise dependencies during action decoding. Third, although VLA policies excel at visually grounded trajectory generation, they remain limited in fine-grained semantic following. These findings highlight future directions for representation-preserving adaptation, causal VLA circuits, and compositional semantic control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLA-Trace gives a concrete diagnostic toolkit for VLA models that surfaces real differences between π0.5 and OpenVLA, but the knockout results rest on interventions whose causal isolation is not yet demonstrated.

read the letter

The paper's main contribution is VLA-Trace, a framework that chains cross-modal CKA for tracking representation shifts, attention knockouts for routing claims, and rollout probes for behavioral limits. Applied to π0.5 and OpenVLA, it reports three observations: the models adapt to vision and language at different rates during finetuning, they show distinct layer-wise dependencies when producing actions, and they generate visually grounded paths more reliably than they follow fine-grained language instructions.

The combination of these three tracing steps on current VLA checkpoints is new enough to be useful. It moves past single-metric evaluations and gives practitioners specific places to look when a model fails on a robot task. The reported distinctions between the two models are the kind of empirical anchor that can guide later architecture choices.

The weakest part is the attention-knockout section. Zeroing out modality-specific heads can reduce overall capacity or trigger compensatory rerouting, and the abstract gives no sign of matched random-knockout or noise controls that would separate those effects from true causal pathways. The semantic-following limitation could also be tied to the particular environments or trajectory lengths chosen rather than a general model property. Without those checks the routing claims stay suggestive rather than definitive.

This is work for people already training or debugging VLA policies who need better internal visibility. It is not a finished theory of multimodal control, but the measurements are reproducible enough to be worth checking. A serious editor should send it to review so the intervention controls and rollout choices can be tightened.

Referee Report

3 major / 1 minor

Summary. The paper introduces VLA-Trace, a diagnostic framework for Vision-Language-Action (VLA) models that combines cross-modal and checkpoint-drift CKA to trace representation evolution during finetuning, attention knockout interventions to attribute modality-specific control pathways during action decoding, and rollout behavioral probes to assess visual grounding, shortcut dependence, and semantic following. Experiments on π0.5 and OpenVLA yield three findings: distinct modality-specific adaptation dynamics, different multimodal routing strategies with layer-wise dependencies, and strong visual trajectory generation but limited fine-grained semantic following.

Significance. If the causal attributions from the interventions hold after appropriate controls, the work would supply concrete empirical evidence on how VLA models integrate vision, language, and action representations, directly informing representation-preserving adaptation methods and compositional semantic control. The unified evidence chain from representations to behavior is a methodological contribution that could be extended to other multimodal embodied models.

major comments (3)

[Attention knockout interventions] Attention knockout section: the interventions that zero modality-specific attention to identify routing strategies and layer-wise dependencies lack reported controls such as random-head knockouts of matched size, magnitude-matched noise injection, or attention redistribution measurements; without these, observed behavioral deltas may reflect generic capacity loss rather than causal modality attribution.
[Rollout-level behavioral probes] Behavioral probes and rollout experiments: the claim that VLA policies are limited in fine-grained semantic following rests on rollout-level probes whose environment selection, trajectory length distribution, and potential shortcut biases are not detailed with quantitative metrics or ablation; this leaves open whether the limitation is model-intrinsic or task/environment-specific.
[Abstract and experimental results] Abstract and results summary: the three key findings are stated without accompanying quantitative values, error bars, statistical tests, or baseline comparisons for the CKA alignments, knockout deltas, or probe success rates, preventing assessment of effect sizes and robustness.

minor comments (1)

[Abstract] Notation for the model π0.5 is rendered as $\\pi_{0.5}$ in the abstract; consistent use of the intended symbol throughout would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below, agreeing where revisions are warranted and providing clarifications where the existing analysis already supports the claims. We will incorporate the suggested controls, details, and metrics in the revised manuscript.

read point-by-point responses

Referee: [Attention knockout interventions] Attention knockout section: the interventions that zero modality-specific attention to identify routing strategies and layer-wise dependencies lack reported controls such as random-head knockouts of matched size, magnitude-matched noise injection, or attention redistribution measurements; without these, observed behavioral deltas may reflect generic capacity loss rather than causal modality attribution.

Authors: We agree that additional controls would strengthen the causal claims. In the revised version we will add random-head knockouts of matched size, magnitude-matched noise injection, and attention redistribution measurements to demonstrate that the observed behavioral changes are attributable to modality-specific routing rather than generic capacity loss. revision: yes
Referee: [Rollout-level behavioral probes] Behavioral probes and rollout experiments: the claim that VLA policies are limited in fine-grained semantic following rests on rollout-level probes whose environment selection, trajectory length distribution, and potential shortcut biases are not detailed with quantitative metrics or ablation; this leaves open whether the limitation is model-intrinsic or task/environment-specific.

Authors: We will expand the behavioral probes section to report quantitative metrics on environment selection, trajectory length distributions, and ablations addressing potential shortcut biases. These additions will help establish whether the observed limitations in fine-grained semantic following are intrinsic to the models or influenced by specific task and environment factors. revision: yes
Referee: [Abstract and experimental results] Abstract and results summary: the three key findings are stated without accompanying quantitative values, error bars, statistical tests, or baseline comparisons for the CKA alignments, knockout deltas, or probe success rates, preventing assessment of effect sizes and robustness.

Authors: We will revise the abstract and results sections to include key quantitative values, error bars, and statistical tests for the main CKA, knockout, and probe results. Due to abstract length constraints, we will prioritize the most salient metrics while ensuring baseline comparisons and full statistics appear prominently in the main text and supplementary material. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements from independent interventions

full rationale

The paper presents an empirical diagnostic framework using CKA for representation tracing, attention knockout for causal attribution, and rollout probes for behavior. These are applied as external measurement tools to existing models (π0.5, OpenVLA); the reported findings are direct outcomes of those measurements rather than quantities defined in terms of the measurements themselves. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on observable deltas from interventions, which are falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on standard assumptions from representation similarity and causal intervention literature rather than new axioms or entities introduced in the paper.

pith-pipeline@v0.9.1-grok · 5740 in / 1135 out tokens · 18323 ms · 2026-06-29T07:42:31.884148+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 4 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

π0: A vision-language-action flow model for general robot control.Preprint, arXiv:2410.24164. Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, and 1 others. 2022. Rt-1: Robotics trans- former for real-world control at scale.arXiv preprint arXiv:2212.068...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag San- keti, and 1 others. 2024. Openvla: An open- source vision-language-action model.arXiv preprint arXiv:2406.0...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Microsoft COCO: Common Objects in Context

Microsoft coco: Common objects in context. Preprint, arXiv:1405.0312. Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Process- ing Systems, 36:44776–44791. Chancharik Mitra, Yusen Luo, Raj Saravanan, Dan- tong Niu, A...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Mechanistic finetuning of vision-language- action models via few-shot demonstrations.arXiv preprint arXiv:2511.22697. NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvi- jit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Di...

work page arXiv 2025
[5]

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

Sparse autoencoders reveal interpretable and steerable features in vla models.arXiv preprint arXiv:2603.19183. Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, and 1 others. 2025a. Gemini robotics 1.5: Pushing the ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

π0: A vision-language-action flow model for general robot control.Preprint, arXiv:2410.24164. Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, and 1 others. 2022. Rt-1: Robotics trans- former for real-world control at scale.arXiv preprint arXiv:2212.068...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag San- keti, and 1 others. 2024. Openvla: An open- source vision-language-action model.arXiv preprint arXiv:2406.0...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Microsoft COCO: Common Objects in Context

Microsoft coco: Common objects in context. Preprint, arXiv:1405.0312. Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Process- ing Systems, 36:44776–44791. Chancharik Mitra, Yusen Luo, Raj Saravanan, Dan- tong Niu, A...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Mechanistic finetuning of vision-language- action models via few-shot demonstrations.arXiv preprint arXiv:2511.22697. NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvi- jit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Di...

work page arXiv 2025

[5] [5]

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

Sparse autoencoders reveal interpretable and steerable features in vla models.arXiv preprint arXiv:2603.19183. Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, and 1 others. 2025a. Gemini robotics 1.5: Pushing the ...

work page internal anchor Pith review Pith/arXiv arXiv 2024