pith. sign in

arxiv: 2607.02322 · v1 · pith:6C7RA3V6new · submitted 2026-07-02 · 💻 cs.RO · cs.CV

The Moving Eye: Enhancing VLA Spatial Generalization via Hybrid Dynamic Data Collection

Pith reviewed 2026-07-03 11:11 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords Vision-Language-Action modelsspatial generalizationshortcut learningdata collectionrobotic manipulationcamera motiondual-arm setup
0
0 comments X

The pith

Hybrid moving and static camera views in data collection reduce spurious correlations and improve VLA generalization to unseen poses where more fixed views fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models often fail at spatial generalization because they latch onto shortcuts like fixed camera-object relations instead of learning actual spatial structure. The paper shows that simply collecting more static viewpoints does not overcome this shortcut learning. A dual-arm robot setup, with one arm moving the camera continuously while also using diverse static positions, produces training data that cuts those correlations. This hybrid pattern improves performance on new camera angles and object arrangements and works across several model types.

Core claim

The authors establish that a hybrid data distribution mixing continuous camera motion with diverse static viewpoints substantially reduces the model's reliance on spurious correlations between camera position and task elements. This enables VLAs to generalize effectively to novel camera poses and object configurations, a capability not achieved by multi-fixed viewpoint strategies. The benefit holds across ACT, Diffusion Policy, and VLA models such as Pi0 and Gr00t.

What carries the argument

The hybrid dynamic data collection strategy using a mobile environmental camera arm in a dual-arm setup to generate mixed moving and static view distributions.

If this is right

  • VLAs trained on hybrid data generalize to unseen camera poses and object configurations.
  • Shortcut learning susceptibility is reduced while training stability is maintained.
  • The improvement applies universally to different VLA architectures including ACT, Diffusion, Pi0, and Gr00t.
  • Adding more static viewpoints alone does not achieve the same generalization gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robotic data collection could use low-cost moving cameras to gain robustness without collecting far more total views.
  • The hybrid motion strategy might apply to other tasks needing viewpoint-invariant perception beyond manipulation.
  • Similar dynamic sampling could be tested for improving sim-to-real transfer on spatial reasoning problems.

Load-bearing premise

The dual-arm hardware and camera motion itself do not introduce new confounding factors or spurious correlations that could explain the observed generalization gains rather than the intended reduction in shortcut learning.

What would settle it

A test showing no generalization gain on unseen poses when the same viewpoint positions are captured statically instead of via continuous motion would falsify the claim that motion is necessary to break the shortcuts.

Figures

Figures reproduced from arXiv: 2607.02322 by Jiang-Jiang Liu, Jiaxing Zhang, Jincheng Tang, Yilong Zhu, Zhengyuan Xie.

Figure 1
Figure 1. Figure 1: Hierarchical Spatial Zoning for Data Collection. We con￾ceptualize the robot’s camera workspace into three configurations: Fixed View, where the camera is static; Multi-Fixed View, where the camera is static within an episode but varies across episodes; and Moving View, the decoupling zone where the camera moves continuously to break spurious correlations. Our key insight is that low-cost decoupled data fr… view at source ↗
Figure 3
Figure 3. Figure 3: • Fixed View: The camera remains at a single static pose throughout all episodes. This represents the standard, constrained setting. • Multi-Fixed View: The camera pose is static within each episode but varies discretely across episodes within a bounded region. This introduces viewpoint diversity without intra-episode motion. • Moving View: The camera moves continuously along trajectories within a bounded … view at source ↗
Figure 2
Figure 2. Figure 2: System pipeline. (1) Dual-arm: The manipulation arm (with wrist camera) executes actions; the environmental camera arm observes the scene from varying viewpoints. (2) Hybrid Dynamic Data Collection: Multi-Fixed View data provide stability for convergence, Moving View data provide decoupling to break spurious correlations; they are mixed at ratio Moving:Multi-Fixed = 1 : k (e.g., k = 3 for the Golden Ratio)… view at source ↗
Figure 3
Figure 3. Figure 3: Real-world Realization of Camera Viewpoint Configurations. We implement Fixed, Multi-Fixed, and Moving Views using an environ￾mental camera controlled by a separate robot arm. The Moving View configuration (Decoupling Zone) allows for continuous trajectory sampling to break spurious correlations. 1) Baseline (Fixed View): Standard practice with a single static camera pose. 2) Baseline (Multi-Fixed View): D… view at source ↗
Figure 4
Figure 4. Figure 4: Experimental Setup. A robot arm controls a mobile environmental eye capturing data from various viewpoints, while another one with a wrist￾mounted camera performs manipulation tasks. (1 pen per try). For Multi-Task, each group consists of 25 tries (5 tries per object type). Success rate computation: Reported success rates are computed over 400 evaluation episodes for Pen tasks and over 100 episodes for Mul… view at source ↗
Figure 5
Figure 5. Figure 5: Shortcut learning (Camera-Base). While the Fixed View Baseline collapses on Moving Tests (OOD), our method maintains high performance. Error bars denote standard deviation. Here, ID = same fixed camera and holder as training; OOD = moving camera (Exp.1 convention). TABLE II BREAKING THE SHORTCUT. SUCCESS RATES (%) AT 2400 SAMPLES. THE GAP BETWEEN ID AND OOD REVEALS SHORTCUT LEARNING. Method ID-Test (Fixed)… view at source ↗
Figure 6
Figure 6. Figure 6: Verification of Object-Position Coupling. The significant drop in performance when the pen holder is shifted (OOD) demonstrates that the model relies on fixed relative positions rather than true spatial understanding. Our method (Green) successfully breaks this coupling. TABLE III OBJECT-POSITION COUPLING VERIFICATION. SUCCESS RATES (%) SHOWING THE IMPACT OF SHIFTING THE OBJECT CONFIGURATION. WHILE THE BAS… view at source ↗
Figure 9
Figure 9. Figure 9: Universality of the Data Strategy. All evaluated architectures (ACT, Diffusion, and VLA models including Pi0 and Gr00t) benefit from the mixed data strategy, with Gr00t showing the highest peak performance. Shaded regions denote standard deviation. 2) Universality of the Data Strategy (Pen Pick-and￾Place): We revisit the Pen Pick-and-Place task to evaluate different architectures, including ACT [26], Diffu… view at source ↗
read the original abstract

Vision-Language-Action (VLA) models have shown remarkable promise in generalized robotic manipulation. However, their spatial generalization remains fragile. We argue that simply increasing the number of viewpoints is insufficient. Models often fall into the trap of Shortcut Learning, latching onto spurious correlations (e.g., fixed relative poses between objects or between the camera and robot base) rather than learning true spatial relationships. In this work, we propose a data-centric solution to enhance VLA spatial generalization. We utilize a dual-arm setup where one arm performs manipulation while the other serves as a mobile environmental camera. We systematically evaluate three data distribution patterns: Fixed, Multi-Fixed, and Moving Views. Our findings reveal that a hybrid strategy, combining continuous camera motion with diverse static viewpoints, yields the best performance by substantially reducing spurious correlations while maintaining training stability. Our experiments demonstrate that this strategy mitigates spurious correlations, enabling VLAs to generalize to unseen camera poses and object configurations where simply adding more static viewpoints fails. Crucially, we reveal that the susceptibility to shortcut learning and the struggle with spatial generalization are universal characteristics shared across diverse architectures. Consequently, all evaluated models (ACT, Diffusion, and VLA models including Pi0 and Gr00t) benefit significantly from our mixed data strategy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that Vision-Language-Action (VLA) models suffer from shortcut learning via spurious correlations (e.g., fixed camera-base or object poses) and that simply adding static viewpoints is insufficient. It proposes a dual-arm data collection setup in which one arm manipulates while the second provides camera motion, systematically comparing Fixed, Multi-Fixed, and Moving Views patterns. The central result is that a hybrid strategy (continuous motion plus diverse static viewpoints) yields the best spatial generalization to unseen camera poses and object configurations, with the benefit appearing universal across ACT, Diffusion Policy, Pi0, and Gr00t.

Significance. If the attribution to shortcut mitigation holds, the work would be significant for robotics: it supplies a practical, hardware-leveraged data-collection recipe that improves generalization without architectural changes and demonstrates that shortcut learning is architecture-agnostic. The dual-arm mobile-camera idea is a concrete contribution to data-centric robotics.

major comments (2)
  1. [§4] §4 (Experimental Setup and Data Collection Strategies): the claim that gains in the hybrid Moving+Multi-Fixed condition arise specifically from reduced spurious correlations is underdetermined. All conditions share the same dual-arm platform; continuous motion necessarily couples camera trajectories to second-arm joint states, potential self-occlusions, and altered end-effector dynamics. No ablation is reported that holds viewpoint statistics fixed while removing motion (or vice versa), so alternative explanations for the observed generalization differences cannot be ruled out.
  2. [Abstract and §4] Abstract and §4 (Results): the manuscript states that systematic comparisons were performed and that the hybrid strategy "substantially reduc[es] spurious correlations," yet no quantitative metrics, error bars, statistical tests, or explicit measurements of spurious correlations (e.g., pose-distribution statistics or shortcut probes) are described. Without these, the magnitude, reliability, and cross-model universality of the claimed benefit cannot be verified.
minor comments (1)
  1. [§3] Clarify the exact sampling procedure and viewpoint distribution statistics for each of the three patterns (Fixed, Multi-Fixed, hybrid) so that readers can assess how closely the Multi-Fixed baseline matches the pose marginals induced by continuous motion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on our experimental design and evidence. Where the comments identify gaps, we indicate planned revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup and Data Collection Strategies): the claim that gains in the hybrid Moving+Multi-Fixed condition arise specifically from reduced spurious correlations is underdetermined. All conditions share the same dual-arm platform; continuous motion necessarily couples camera trajectories to second-arm joint states, potential self-occlusions, and altered end-effector dynamics. No ablation is reported that holds viewpoint statistics fixed while removing motion (or vice versa), so alternative explanations for the observed generalization differences cannot be ruled out.

    Authors: We acknowledge that the current comparisons do not include an ablation that isolates continuous motion dynamics from viewpoint statistics while holding other factors constant. The Fixed and Multi-Fixed conditions use static viewpoints, while Moving Views and the hybrid introduce motion on the same platform, so confounding effects from joint coupling or occlusions cannot be fully excluded. We will add an explicit discussion of this limitation in the revised §4 and §6, and note that future work could include such controls. The hybrid strategy's empirical superiority across conditions remains as reported, but we agree the causal attribution to shortcut reduction is not definitive without further isolation. revision: partial

  2. Referee: [Abstract and §4] Abstract and §4 (Results): the manuscript states that systematic comparisons were performed and that the hybrid strategy "substantially reduc[es] spurious correlations," yet no quantitative metrics, error bars, statistical tests, or explicit measurements of spurious correlations (e.g., pose-distribution statistics or shortcut probes) are described. Without these, the magnitude, reliability, and cross-model universality of the claimed benefit cannot be verified.

    Authors: The full manuscript reports success rates and generalization gaps for all conditions and models in §5 (with tables comparing Fixed, Multi-Fixed, Moving, and hybrid), but we agree that error bars, statistical tests, and direct measurements of spurious correlations (such as pose distribution overlap or shortcut probes) are not explicitly provided. We will revise the abstract, §4, and §5 to include error bars on all reported metrics, add statistical significance tests between conditions, and include quantitative analysis of viewpoint and pose distributions to support the reduction in spurious correlations. This will allow verification of the magnitude and cross-model consistency. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of data-collection strategies

full rationale

The manuscript contains no equations, fitted parameters, predictions derived from prior fits, or self-citations used as load-bearing premises. All claims rest on direct experimental comparisons (Fixed vs. Multi-Fixed vs. Moving Views) across ACT, Diffusion, and VLA models. No derivation chain exists that could reduce to its own inputs by construction; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and relies on the domain assumption that shortcut learning explains the generalization failures; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Shortcut learning on spurious correlations such as fixed relative poses is the main cause of fragile spatial generalization in VLA models
    Stated directly in the abstract as the reason increasing viewpoints alone fails.

pith-pipeline@v0.9.1-grok · 5767 in / 1394 out tokens · 35486 ms · 2026-07-03T11:11:02.287232+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 15 canonical work pages · 6 internal anchors

  1. [1]

    Libero-plus: A progressive robustness benchmark for visual-language-action models,

    S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, J. Fu, J. Gong, and X. Qiu, “Libero-plus: A progressive robustness benchmark for visual-language-action models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 38 574–38 583

  2. [2]

    LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

    X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun, “Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization,”arXiv preprint arXiv:2510.03827, 2025

  3. [3]

    Decomposing the generalization gap in imitation learning for visual robotic manipulation,

    A. Xie, L. Lee, T. Xiao, and C. Finn, “Decomposing the generalization gap in imitation learning for visual robotic manipulation,”arXiv preprint arXiv:2307.03659, 2023

  4. [4]

    Radar: Benchmarking vision-language-action generalization via real-world dynamics, spatial-physical intelligence, and autonomous evaluation,

    Y . Chen, Z. Zhan, X. Lin, Z. Song, H. Liu, Q. Lyu, Y . Zu, X. Chen, Z. Liu, T. Puet al., “Radar: Benchmarking vision-language-action generalization via real-world dynamics, spatial-physical intelligence, and autonomous evaluation,”arXiv preprint arXiv:2602.10980, 2026

  5. [5]

    Project Aria: A New Tool for Egocentric Multi-Modal AI Research

    J. Engel, K. Somasundaram, M. Goesele, A. Sun, A. Gamino, A. Turner, A. Talattof, A. Yuan, B. Souti, B. Meredithet al., “Project aria: A new tool for egocentric multi-modal ai research,”arXiv preprint arXiv:2308.13561, 2023

  6. [6]

    Move: A simple motion-based data collection paradigm for spatial generalization in robotic manipulation,

    H. Wang, C. B. Chen, Y . Yue, D. Tao, T. Guo, S. Xie, D. Huang, S. Song, G. Yao, and G. Huang, “Move: A simple motion-based data collection paradigm for spatial generalization in robotic manipulation,”

  7. [7]
  8. [8]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huanget al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025

  9. [9]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

  10. [10]

    Spa: 3d spatial-awareness enables effective embodied representation,

    H. Zhu, H. Yang, Y . Wang, J. Yang, L. Wang, and T. He, “Spa: 3d spatial-awareness enables effective embodied representation,”arXiv preprint arXiv:2410.08208, 2024

  11. [11]

    Abouzeid, M

    A. Abouzeid, M. Mansour, Z. Sun, and D. Song, “Geoaware-vla: Im- plicit geometry aware vision-language-action model,”arXiv preprint arXiv:2509.14117, 2025

  12. [12]

    Og- vla: 3d-aware vision language action model via orthographic image generation,

    I. Singh, A. Goyal, S. Birchfield, D. Fox, A. Garg, and V . Blukis, “Og- vla: 3d-aware vision language action model via orthographic image generation,”arXiv e-prints, pp. arXiv–2506, 2025

  13. [13]

    Rvt: Robotic view transformer for 3d object manipulation,

    A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox, “Rvt: Robotic view transformer for 3d object manipulation,” inConference on Robot Learning. PMLR, 2023, pp. 694–710

  14. [14]

    Manivid-3d: Generalizable view-invariant reinforcement learning for robotic manipulation via disentangled 3d representations,

    Z. Li, P. Qu, Y . Jia, S. Zhou, H. Ge, J. Cao, J. Zhou, G. Zhou, and J. Ma, “Manivid-3d: Generalizable view-invariant reinforcement learning for robotic manipulation via disentangled 3d representations,” IEEE Robotics and Automation Letters, 2026

  15. [15]

    Vla models are more generalizable than you think: Revisiting physical and spatial modeling,

    W. Li, Q. Zhang, R. Zhai, L. Lin, and G. Wang, “Vla models are more generalizable than you think: Revisiting physical and spatial modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 35 025–35 035

  16. [16]

    Visual-policy learning through multi-camera view to single-camera view knowledge distil- lation for robot manipulation tasks,

    C. Acar, K. Binici, A. Tekirda ˘g, and Y . Wu, “Visual-policy learning through multi-camera view to single-camera view knowledge distil- lation for robot manipulation tasks,”IEEE Robotics and Automation Letters, vol. 9, no. 1, pp. 691–698, 2023

  17. [17]

    Do you know where your camera is? View-invariant policy learning with camera conditioning,

    T. Jiang, J. Ji, X. Tan, J. Fang, A. Bhattad, V . Guizilini, and M. R. Walter, “Do you know where your camera is? View-invariant policy learning with camera conditioning,” inIEEE International Conference on Robotics and Automation, 2026

  18. [18]

    Agnostic manip- ulation policies with strategic vantage selection,

    S. Vasudevan, S. Sagar, and R. Senanayake, “Agnostic manip- ulation policies with strategic vantage selection,”arXiv preprint arXiv:2506.12261, 2025

  19. [19]

    Multi-view masked world models for visual robotic manipulation,

    Y . Seo, J. Kim, S. James, K. Lee, J. Shin, and P. Abbeel, “Multi-view masked world models for visual robotic manipulation,” inInterna- tional Conference on Machine Learning. PMLR, 2023, pp. 30 613– 30 632

  20. [20]

    CLAR: Learning 3D Representations for Robotic Manipulation by Fusing Masked Reconstruction with Multi-Level Contrastive Alignment

    W. Cui, C. Zhao, Y . Chen, H. Li, Z. Zhang, D. Zhao, and H. Wang, “Cl3r: 3d reconstruction and contrastive learning for enhanced robotic manipulation representations,”arXiv preprint arXiv:2507.08262, 2025

  21. [21]

    View-invariant policy learning via zero-shot novel view synthesis,

    S. Tian, B. Wulfe, K. Sargent, K. Liu, S. Zakharov, V . C. Guizilini, and J. Wu, “View-invariant policy learning via zero-shot novel view synthesis,” inConference on Robot Learning, 2024

  22. [22]

    Novel demonstration generation with gaussian splatting enables ro- bust one-shot manipulation,

    S. Yang, W. Yu, J. Zeng, J. Lv, K. Ren, C. Lu, D. Lin, and J. Pang, “Novel demonstration generation with gaussian splatting enables ro- bust one-shot manipulation,” inProceedings of Robotics: Science and Systems, 2025

  23. [23]

    Invariance co-training for robot visual generalization,

    J. Yang, C. Finn, and D. Sadigh, “Invariance co-training for robot visual generalization,”arXiv preprint arXiv:2512.05230, 2025

  24. [24]

    Adversarial data collection: Human-collaborative perturba- tions for efficient and robust robotic imitation learning,

    S. Huang, Y . Liao, S. Feng, S. Jiang, S. Liu, H. Li, M. Yao, and G. Ren, “Adversarial data collection: Human-collaborative perturba- tions for efficient and robust robotic imitation learning,”arXiv preprint arXiv:2503.11646, 2025

  25. [25]

    Vision in action: Learning active perception from human demonstrations,

    H. Xiong, X. Xu, J. Wu, Y . Hou, J. Bohg, and S. Song, “Vision in action: Learning active perception from human demonstrations,” in Conference on Robot Learning, 2025

  26. [26]

    Activeumi: Robotic manipulation with active perception from robot-free human demonstrations,

    Q. Zeng, C. Li, J. St. John, Z. Zhou, J. Wen, G. Feng, Y . Zhu, and Y . Xu, “Activeumi: Robotic manipulation with active perception from robot-free human demonstrations,”arXiv preprint arXiv:2510.01607, 2025

  27. [27]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13705

  28. [28]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Proceedings of Robotics: Science and Systems, 2023