pith. machine review for the scientific record. sign in

arxiv: 2604.14125 · v2 · submitted 2026-04-15 · 💻 cs.CV · cs.AI· cs.RO

Recognition: 2 theorem links

· Lean Theorem

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords robotic manipulationvision-language-actionhierarchical architecturevisual groundingdiffusion transformerlong-horizon tasksfine-grained manipulationembodied AI
0
0 comments X

The pith

HiVLA decouples VLM planning from diffusion control to preserve reasoning and improve precise robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that separating high-level task planning and visual grounding in a vision-language model from low-level action generation solves a key limitation in robotic systems. End-to-end training on control data often weakens the model's ability to reason about new tasks or scenes. HiVLA uses the VLM to break down instructions into subtasks with target boxes, then passes those to a specialized diffusion model that executes movements by focusing on relevant visual details. This separation lets each part improve on its own and leads to stronger results on tasks requiring many steps or careful handling of small items amid clutter.

Core claim

HiVLA is a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In the high-level part, a VLM planner performs task decomposition and visual grounding to generate structured plans comprising a subtask instruction and a precise target bounding box. These plans are translated into physical actions by a flow-matching Diffusion Transformer action expert in the low-level part that uses a cascaded cross-attention mechanism to sequentially fuse global context, high-resolution object-centric crops, and skill semantics.

What carries the argument

The visual-grounded-centric hierarchical decoupling between a VLM planner that outputs subtask instructions and target bounding boxes and a flow-matching Diffusion Transformer action expert that executes via cascaded cross-attention.

If this is right

  • The VLM's zero-shot reasoning capabilities remain intact because no control data is used to fine-tune it.
  • The planning and execution modules can be improved or swapped independently without retraining the full system.
  • Performance gains appear most clearly in long-horizon skill composition that requires sequencing multiple actions.
  • The system handles fine-grained manipulation of small objects in cluttered scenes more reliably than unified models.
  • Experiments in both simulation and the real world show consistent outperformance over state-of-the-art end-to-end VLA baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Newer vision-language models could be inserted into the high-level slot to gain immediate benefits in planning quality.
  • The design suggests that data efficiency may improve because the action expert trains only on execution data rather than full task reasoning.
  • This separation could extend to other embodied tasks where reasoning must remain flexible while physical execution stays precise.

Load-bearing premise

The VLM planner reliably produces accurate subtask instructions and precise target bounding boxes for new tasks without errors or additional fine-tuning.

What would settle it

Test HiVLA on a long-horizon real-world task with small objects in clutter where the VLM planner outputs an incorrect bounding box or subtask on the first attempt, then measure whether overall success rate falls to or below that of end-to-end baselines.

Figures

Figures reproduced from arXiv: 2604.14125 by Chunpu Xu, Guanyu Chen, Haotian Liang, Jiangmiao Pang, Ping Luo, Tianshuo Yang, Yao Mu, Yitian Liu, Yutian Chen, Zanxin Chen, Zhixuan Liang.

Figure 1
Figure 1. Figure 1: (a) Overview of our proposed HiVLA framework. (b) Success rate comparison on RoboTwin benchmark. the development of Vision-Language-Action (VLA) models [6, 17, 20]. However, current VLA research predominantly adopts end-to-end architectures, utilizing either single-system [7, 19, 40] or dual-system [5, 8, 9] approaches that tightly couple visual reasoning with low-level action generation. Although these in… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of HiVLA. (a) Our decoupled framework utilizes a VLM to decom￾pose user instructions into explicit structured plans, yielding a skill-level subtask and a bounding box used to extract a high-resolution target crop. (b) To execute this plan, the DiT action expert employs a cascaded cross-attention block. This design sequen￾tially conditions the noisy action latents on global visual context, position… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of RoboTwin tasks and real-world tasks. language features, thereby conditioning the generated motion on the precise semantics of required skill (e.g., distinguishing among ‘pick’, ‘place’, or ‘push’). Finally, after passing through all transformer blocks, the output hidden states corresponding to the action sequence are processed by a final MLP-based Action Decoder. This decoder, also modulat… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of RoboTwin tasks [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of real-world tasks [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
read the original abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes HiVLA, a hierarchical visual-grounded-centric framework for robotic manipulation that decouples high-level VLM-based semantic planning (task decomposition into subtask instructions and precise target bounding boxes) from low-level control via a flow-matching Diffusion Transformer (DiT) action expert equipped with a cascaded cross-attention mechanism. The architecture is claimed to preserve the base VLM's zero-shot reasoning while enabling independent component improvements, with extensive simulation and real-world experiments showing significant outperformance over state-of-the-art end-to-end VLA baselines, especially in long-horizon skill composition and fine-grained manipulation of small objects in cluttered scenes.

Significance. If the empirical claims are substantiated, the work would be significant for embodied AI and vision-language-action models by resolving the documented trade-off between control fine-tuning and retention of general VLM reasoning. The modular decoupling could support more scalable systems with independent advances in planning and execution, potentially improving generalization in complex manipulation scenarios.

major comments (2)
  1. [Experiments] Experiments section: The central claim of significant outperformance and zero-shot preservation is load-bearing on the VLM planner reliably generating accurate subtask instructions and precise bounding boxes without task-specific fine-tuning or errors. However, no quantitative planner success rates, error breakdowns, or ablations (e.g., performance when the VLM hallucinates or mis-grounds on small/cluttered objects) are reported, leaving open whether gains derive from the architecture or from implicit task curation/low-level compensation.
  2. [Method] Method (high-level planner and low-level DiT): The weakest assumption—that the decoupled design preserves VLM zero-shot capabilities while the cascaded cross-attention enables robust execution—is not supported by evidence such as planner accuracy metrics or comparisons showing what happens under planner failure in the exact regimes where superiority is claimed. Without these, the outperformance cannot be confidently attributed to the hierarchical structure.
minor comments (2)
  1. [Abstract] Abstract: Strong empirical claims are made without any numerical results, specific metrics, or baseline names, which is atypical and reduces immediate verifiability.
  2. [Method] Notation and figures: The cascaded cross-attention mechanism is described as novel but would benefit from a clearer diagram or pseudocode distinguishing it from standard DiT cross-attention, along with explicit input/output specifications for the global context, object-centric crops, and skill semantics fusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and commit to revisions that incorporate the requested quantitative evaluations of the high-level planner to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central claim of significant outperformance and zero-shot preservation is load-bearing on the VLM planner reliably generating accurate subtask instructions and precise bounding boxes without task-specific fine-tuning or errors. However, no quantitative planner success rates, error breakdowns, or ablations (e.g., performance when the VLM hallucinates or mis-grounds on small/cluttered objects) are reported, leaving open whether gains derive from the architecture or from implicit task curation/low-level compensation.

    Authors: We agree that the manuscript currently lacks explicit quantitative metrics on the VLM planner, such as subtask instruction accuracy, bounding box precision, and error breakdowns. While the reported end-to-end task success rates in simulation and real-world settings, especially for long-horizon and fine-grained tasks, provide indirect support for the planner's effectiveness, we acknowledge this does not fully isolate the planner's contribution. In the revised manuscript, we will add a new analysis section with planner success rates on held-out tasks, categorized error breakdowns (including hallucination and mis-grounding on small/cluttered objects), and ablations that simulate planner errors to demonstrate low-level compensation. revision: yes

  2. Referee: [Method] Method (high-level planner and low-level DiT): The weakest assumption—that the decoupled design preserves VLM zero-shot capabilities while the cascaded cross-attention enables robust execution—is not supported by evidence such as planner accuracy metrics or comparisons showing what happens under planner failure in the exact regimes where superiority is claimed. Without these, the outperformance cannot be confidently attributed to the hierarchical structure.

    Authors: We concur that direct evidence for zero-shot preservation and behavior under planner failures is needed to confidently attribute gains to the hierarchical decoupling. The current work uses the VLM in a zero-shot manner without fine-tuning and shows outperformance over end-to-end fine-tuned baselines, but we agree this is insufficient. We will revise the method and experiments sections to include planner accuracy metrics on standard and custom tasks, as well as controlled comparisons and failure-case simulations in the regimes of long-horizon composition and fine-grained manipulation, highlighting the role of the cascaded cross-attention DiT. revision: yes

Circularity Check

0 steps flagged

No circularity in architectural and empirical claims

full rationale

The paper describes a decoupled hierarchical architecture separating VLM-based high-level planning (task decomposition and visual grounding) from a flow-matching DiT low-level action expert, with claims supported by simulation and real-world experiments showing outperformance on long-horizon tasks. No equations, parameter fitting, or derivations are present that reduce outputs to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The VLM zero-shot assumption is treated as an external capability rather than internally derived, rendering the overall chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The design rests on the assumption that VLMs retain reasoning when decoupled and introduces one new architectural component without external validation.

axioms (1)
  • domain assumption Fine-tuning end-to-end VLAs on narrow control data compromises inherited VLM reasoning capabilities
    This trade-off is stated as the core motivation for the hierarchical split.
invented entities (1)
  • cascaded cross-attention mechanism no independent evidence
    purpose: Sequentially fuses global context, high-resolution object-centric crops, and skill semantics in the DiT action expert
    Presented as a novel design element enabling focused execution in the low-level component.

pith-pipeline@v0.9.0 · 5555 in / 1201 out tokens · 58607 ms · 2026-05-12T01:46:22.347057+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 16 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 1, 10, 11, 19

  2. [2]

    Belkhale, S., Ding, T., Xiao, T., Sermanet, P., Vuong, Q., Tompson, J., Chebotar, Y.,Dwibedi,D.,Sadigh,D.:Rt-h:Actionhierarchiesusinglanguage.arXivpreprint arXiv:2403.01823 (2024) 2

  3. [3]

    3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks,

    Bhat, V., Lan, Y.H., Krishnamurthy, P., Karri, R., Khorrami, F.: 3d cavla: Lever- aging depth and 3d context to generalize vision language action models for unseen tasks. arXiv preprint arXiv:2505.05800 (2025) 4

  4. [4]

    H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523,

    Bi, H., Wu, L., Lin, T., Tan, H., Su, Z., Su, H., Zhu, J.: H-rdt: Human manipulation enhanced bimanual robotic manipulation. arXiv preprint arXiv:2507.23523 (2025) 3, 10

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025) 2, 3, 10

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024) 2, 3, 10

  7. [7]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Brohan, Anthony, e.a.: Rt-2: Vision-language-action models transfer web knowl- edge to robotic control. arXiv preprint arXiv:2307.15818 (2023) 2, 3

  8. [8]

    Towards synergistic, generalized, and efficient dual-system for robotic manipulation.arXiv preprint arXiv:2410.08001, 2024a

    Bu, Q., Li, H., Chen, L., Cai, J., Zeng, J., Cui, H., Yao, M., Qiao, Y.: Towards synergistic, generalized, and efficient dual-system for robotic manipulation. arXiv preprint arXiv:2410.08001 (2024) 2

  9. [9]

    Gr-3 technical report.arXiv preprint arXiv:2507.15493,

    Cheang, Catherine, e.a.: Gr-3 technical report. arXiv preprint arXiv:2507.15493 (2025) 2

  10. [10]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Li, Z., Liang, Q., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025) 3, 9

  11. [11]

    InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    Chen, X., Chen, Y., Fu, Y., Gao, N., Jia, J., Jin, W., Li, H., Mu, Y., Pang, J., Qiao, Y., Tian, Y., Wang, B., Wang, B., Wang, F., Wang, H., Wang, T., Wang, Z., Wei, X., Wu, C., Yang, S., Ye, J., Yu, J., Zeng, J., Zhang, J., Zhang, J., Zhang, S., Zheng, F., Zhou, B., Zhu, Y.: Internvla-m1: A spatially guided vision-language- action framework for generalis...

  12. [12]

    GitHub repository (1 2025).https: //doi

    starVLA Contributors: Starvla: A lego-like codebase for vision-language-action model developing. GitHub repository (1 2025).https: //doi. org/10 .5281/ zenodo.18264214,https://github.com/starVLA/starVLA10

  13. [13]

    Driess, D., Springenberg, J.T., Ichter, B., Yu, L., Li-Bell, A., Pertsch, K., Ren, A.Z., Walke, H., Vuong, Q., Shi, L.X., Levine, S.: Knowledge insulating vision- language-action models: Train fast, run fast, generalize better (2025),https:// arxiv.org/abs/2505.237052, 4

  14. [14]

    Interleave-VLA: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

    Fan, C., Jia, X., Sun, Y., Wang, Y., Wei, J., Gong, Z., Zhao, X., Tomizuka, M., Yang, X., Yan, J., et al.: Interleave-vla: Enhancing robot manipulation with in- terleaved image-text instructions. arXiv preprint arXiv:2505.02152 (2025) 2, 4, 10

  15. [15]

    Hancock, A.J., Wu, X., Zha, L., Russakovsky, O., Majumdar, A.: Actions as lan- guage: Fine-tuning vlms into vlas without catastrophic forgetting (2025),https: //arxiv.org/abs/2509.221952, 4

  16. [16]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Huang, H., Chen, X., Chen, Y., Li, H., Han, X., Wang, Z., Wang, T., Pang, J., Zhao, Z.: Roboground: Robotic manipulation with grounded vision-language pri- ors. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22540–22550 (2025) 4

  17. [17]

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M.Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Ren, A.Z., Shi, L.X., Smith, L., Springenberg, J.T., Stachow...

  18. [18]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Jiang, Y., Gu, J., Xue, T., Cheung, K.C., Molchanov, P., Yin, H., Liu, S.: Token- efficient vlm: High-resolution image understanding via dynamic region proposal. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 24147–24158 (October 2025) 2

  19. [19]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025) 2

  20. [20]

    In: Conference on Robot Learning

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., et al.: Openvla: An open-source vision-language-action model. In: Conference on Robot Learning. pp. 2679–2713. PMLR (2025) 2, 3

  21. [21]

    Lai, X., Li, J., Li, W., Liu, T., Li, T., Zhao, H.: Mini-o3: Scaling up reasoning patterns and interaction turns for visual search (2025),https://arxiv.org/abs/ 2509.079692

  22. [22]

    Li, Y., Deng, Y., Zhang, J., Jang, J., Memmel, M., Yu, R., Garrett, C.R., Ramos, F., Fox, D., Li, A., Gupta, A., Goyal, A.: Hamster: Hierarchical action models for open-world robot manipulation (2025),https://arxiv.org/abs/2502.054854

  23. [23]

    Liang, Z., Mu, Y., Ma, H., Tomizuka, M., Ding, M., Luo, P.: Skilldiffuser: Inter- pretable hierarchical planning via skill abstractions in diffusion-based task execu- tion (2024),https://arxiv.org/abs/2312.115982

  24. [24]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024) 1

  25. [25]

    In: European conference on computer vision

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024) 13 HiVLA 17

  26. [26]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 19

  27. [27]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 19

  28. [28]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 3, 4

  29. [29]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 13

  30. [30]

    arXiv preprint arXiv:2508.13073 (2025)

    Shao, R., Li, W., Zhang, L., Zhang, R., Liu, Z., Chen, R., Nie, L.: Large vlm-based vision-language-action models for robotic manipulation: A survey. arXiv preprint arXiv:2508.13073 (2025) 3

  31. [31]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Shi, B., Li, B., Cai, H., Lu, Y., Liu, S., Pavone, M., Kautz, J., Han, S., Darrell, T., Molchanov, P., Yin, H.: Scaling vision pre-training to 4k resolution. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9631–9640 (June 2025) 2

  32. [32]

    Shi, L.X., Ichter, B., Equi, M., Ke, L., Pertsch, K., Vuong, Q., Tanner, J., Walling, A., Wang, H., Fusai, N., Li-Bell, A., Driess, D., Groom, L., Levine, S., Finn, C.: Hi robot: Open-ended instruction following with hierarchical vision-language-action models (2025),https://arxiv.org/abs/2502.194172, 4, 10

  33. [33]

    ReconVLA: Reconstructive vision-language-action model as effective robot perceiver.arXiv preprint arXiv:2508.10333, 2025

    Song, W., Zhou, Z., Zhao, H., Chen, J., Ding, P., Yan, H., Huang, Y., Tang, F., Wang, D., Li, H.: Reconvla: Reconstructive vision-language-action model as effective robot perceiver. arXiv preprint arXiv:2508.10333 (2025) 4

  34. [34]

    Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

    Sridhar, A., Pan, J., Sharma, S., Finn, C.: Memer: Scaling up memory for robot control via experience retrieval. arXiv preprint arXiv:2510.20328 (2025) 4

  35. [35]

    arXiv preprint arXiv:2412.03555 (2024) 1

    Steiner, A., Pinto, A.S., Tschannen, M., Keysers, D., Wang, X., Bitton, Y., Grit- senko, A., Minderer, M., Sherbondy, A., Long, S., et al.: Paligemma 2: A family of versatile vlms for transfer. arXiv preprint arXiv:2412.03555 (2024) 1

  36. [36]

    Team, G.R., et al.: Gemini robotics: Bringing ai into the physical world (2025), https://arxiv.org/abs/2503.2002010

  37. [37]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 19

  38. [38]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025) 19

  39. [39]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Wu, H., Jing, Y., Cheang, C., Chen, G., Xu, J., Li, X., Liu, M., Li, H., Kong, T.: Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139 (2023) 4

  40. [40]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhao, Q., Lu, Y., Kim, M.J., Fu, Z., Zhang, Z., Wu, Y., Li, Z., Ma, Q., Han, S., Finn, C., et al.: Cot-vla: Visual chain-of-thought reasoning for vision-language- action models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1702–1713 (2025) 2

  41. [41]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing "thinking with images" via reinforcement learning (2025), https://arxiv.org/abs/2505.143622 18 T. Yang et al

  42. [42]

    Dexgraspvla: A vision-language- action framework towards general dexterous grasping,

    Zhong, Y., Huang, X., Li, R., Zhang, C., Chen, Z., Guan, T., Zeng, F., Lui, K.N., Ye, Y., Liang, Y., Yang, Y., Chen, Y.: Dexgraspvla: A vision-language-action framework towards general dexterous grasping (2025),https://arxiv.org/abs/ 2502.209002, 4, 10 HiVLA 19 Supplementary Material 1 DiT Model Details ImplementationDetailsWeimplementedourmodelusingthePy...