arxiv: 2604.14125 · v2 · submitted 2026-04-15 · 💻 cs.CV · cs.AI· cs.RO

Recognition: 2 theorem links

· Lean Theorem

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Tianshuo Yang , Guanyu Chen , Yutian Chen , Zhixuan Liang , Yitian Liu , Zanxin Chen , Chunpu Xu , Haotian Liang

show 3 more authors

Jiangmiao Pang Yao Mu Ping Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords robotic manipulationvision-language-actionhierarchical architecturevisual groundingdiffusion transformerlong-horizon tasksfine-grained manipulationembodied AI

0 comments

The pith

HiVLA decouples VLM planning from diffusion control to preserve reasoning and improve precise robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that separating high-level task planning and visual grounding in a vision-language model from low-level action generation solves a key limitation in robotic systems. End-to-end training on control data often weakens the model's ability to reason about new tasks or scenes. HiVLA uses the VLM to break down instructions into subtasks with target boxes, then passes those to a specialized diffusion model that executes movements by focusing on relevant visual details. This separation lets each part improve on its own and leads to stronger results on tasks requiring many steps or careful handling of small items amid clutter.

Core claim

HiVLA is a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In the high-level part, a VLM planner performs task decomposition and visual grounding to generate structured plans comprising a subtask instruction and a precise target bounding box. These plans are translated into physical actions by a flow-matching Diffusion Transformer action expert in the low-level part that uses a cascaded cross-attention mechanism to sequentially fuse global context, high-resolution object-centric crops, and skill semantics.

What carries the argument

The visual-grounded-centric hierarchical decoupling between a VLM planner that outputs subtask instructions and target bounding boxes and a flow-matching Diffusion Transformer action expert that executes via cascaded cross-attention.

If this is right

The VLM's zero-shot reasoning capabilities remain intact because no control data is used to fine-tune it.
The planning and execution modules can be improved or swapped independently without retraining the full system.
Performance gains appear most clearly in long-horizon skill composition that requires sequencing multiple actions.
The system handles fine-grained manipulation of small objects in cluttered scenes more reliably than unified models.
Experiments in both simulation and the real world show consistent outperformance over state-of-the-art end-to-end VLA baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Newer vision-language models could be inserted into the high-level slot to gain immediate benefits in planning quality.
The design suggests that data efficiency may improve because the action expert trains only on execution data rather than full task reasoning.
This separation could extend to other embodied tasks where reasoning must remain flexible while physical execution stays precise.

Load-bearing premise

The VLM planner reliably produces accurate subtask instructions and precise target bounding boxes for new tasks without errors or additional fine-tuning.

What would settle it

Test HiVLA on a long-horizon real-world task with small objects in clutter where the VLM planner outputs an incorrect bounding box or subtask on the first attempt, then measure whether overall success rate falls to or below that of end-to-end baselines.

Figures

Figures reproduced from arXiv: 2604.14125 by Chunpu Xu, Guanyu Chen, Haotian Liang, Jiangmiao Pang, Ping Luo, Tianshuo Yang, Yao Mu, Yitian Liu, Yutian Chen, Zanxin Chen, Zhixuan Liang.

**Figure 1.** Figure 1: (a) Overview of our proposed HiVLA framework. (b) Success rate comparison on RoboTwin benchmark. the development of Vision-Language-Action (VLA) models [6, 17, 20]. However, current VLA research predominantly adopts end-to-end architectures, utilizing either single-system [7, 19, 40] or dual-system [5, 8, 9] approaches that tightly couple visual reasoning with low-level action generation. Although these in… view at source ↗

**Figure 2.** Figure 2: Pipeline of HiVLA. (a) Our decoupled framework utilizes a VLM to decompose user instructions into explicit structured plans, yielding a skill-level subtask and a bounding box used to extract a high-resolution target crop. (b) To execute this plan, the DiT action expert employs a cascaded cross-attention block. This design sequentially conditions the noisy action latents on global visual context, position… view at source ↗

**Figure 3.** Figure 3: Visualization of RoboTwin tasks and real-world tasks. language features, thereby conditioning the generated motion on the precise semantics of required skill (e.g., distinguishing among ‘pick’, ‘place’, or ‘push’). Finally, after passing through all transformer blocks, the output hidden states corresponding to the action sequence are processed by a final MLP-based Action Decoder. This decoder, also modulat… view at source ↗

**Figure 4.** Figure 4: Visualization of RoboTwin tasks [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of real-world tasks [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

read the original abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes HiVLA, a hierarchical visual-grounded-centric framework for robotic manipulation that decouples high-level VLM-based semantic planning (task decomposition into subtask instructions and precise target bounding boxes) from low-level control via a flow-matching Diffusion Transformer (DiT) action expert equipped with a cascaded cross-attention mechanism. The architecture is claimed to preserve the base VLM's zero-shot reasoning while enabling independent component improvements, with extensive simulation and real-world experiments showing significant outperformance over state-of-the-art end-to-end VLA baselines, especially in long-horizon skill composition and fine-grained manipulation of small objects in cluttered scenes.

Significance. If the empirical claims are substantiated, the work would be significant for embodied AI and vision-language-action models by resolving the documented trade-off between control fine-tuning and retention of general VLM reasoning. The modular decoupling could support more scalable systems with independent advances in planning and execution, potentially improving generalization in complex manipulation scenarios.

major comments (2)

[Experiments] Experiments section: The central claim of significant outperformance and zero-shot preservation is load-bearing on the VLM planner reliably generating accurate subtask instructions and precise bounding boxes without task-specific fine-tuning or errors. However, no quantitative planner success rates, error breakdowns, or ablations (e.g., performance when the VLM hallucinates or mis-grounds on small/cluttered objects) are reported, leaving open whether gains derive from the architecture or from implicit task curation/low-level compensation.
[Method] Method (high-level planner and low-level DiT): The weakest assumption—that the decoupled design preserves VLM zero-shot capabilities while the cascaded cross-attention enables robust execution—is not supported by evidence such as planner accuracy metrics or comparisons showing what happens under planner failure in the exact regimes where superiority is claimed. Without these, the outperformance cannot be confidently attributed to the hierarchical structure.

minor comments (2)

[Abstract] Abstract: Strong empirical claims are made without any numerical results, specific metrics, or baseline names, which is atypical and reduces immediate verifiability.
[Method] Notation and figures: The cascaded cross-attention mechanism is described as novel but would benefit from a clearer diagram or pseudocode distinguishing it from standard DiT cross-attention, along with explicit input/output specifications for the global context, object-centric crops, and skill semantics fusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and commit to revisions that incorporate the requested quantitative evaluations of the high-level planner to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim of significant outperformance and zero-shot preservation is load-bearing on the VLM planner reliably generating accurate subtask instructions and precise bounding boxes without task-specific fine-tuning or errors. However, no quantitative planner success rates, error breakdowns, or ablations (e.g., performance when the VLM hallucinates or mis-grounds on small/cluttered objects) are reported, leaving open whether gains derive from the architecture or from implicit task curation/low-level compensation.

Authors: We agree that the manuscript currently lacks explicit quantitative metrics on the VLM planner, such as subtask instruction accuracy, bounding box precision, and error breakdowns. While the reported end-to-end task success rates in simulation and real-world settings, especially for long-horizon and fine-grained tasks, provide indirect support for the planner's effectiveness, we acknowledge this does not fully isolate the planner's contribution. In the revised manuscript, we will add a new analysis section with planner success rates on held-out tasks, categorized error breakdowns (including hallucination and mis-grounding on small/cluttered objects), and ablations that simulate planner errors to demonstrate low-level compensation. revision: yes
Referee: [Method] Method (high-level planner and low-level DiT): The weakest assumption—that the decoupled design preserves VLM zero-shot capabilities while the cascaded cross-attention enables robust execution—is not supported by evidence such as planner accuracy metrics or comparisons showing what happens under planner failure in the exact regimes where superiority is claimed. Without these, the outperformance cannot be confidently attributed to the hierarchical structure.

Authors: We concur that direct evidence for zero-shot preservation and behavior under planner failures is needed to confidently attribute gains to the hierarchical decoupling. The current work uses the VLM in a zero-shot manner without fine-tuning and shows outperformance over end-to-end fine-tuned baselines, but we agree this is insufficient. We will revise the method and experiments sections to include planner accuracy metrics on standard and custom tasks, as well as controlled comparisons and failure-case simulations in the regimes of long-horizon composition and fine-grained manipulation, highlighting the role of the cascaded cross-attention DiT. revision: yes

Circularity Check

0 steps flagged

No circularity in architectural and empirical claims

full rationale

The paper describes a decoupled hierarchical architecture separating VLM-based high-level planning (task decomposition and visual grounding) from a flow-matching DiT low-level action expert, with claims supported by simulation and real-world experiments showing outperformance on long-horizon tasks. No equations, parameter fitting, or derivations are present that reduce outputs to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The VLM zero-shot assumption is treated as an external capability rather than internally derived, rendering the overall chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The design rests on the assumption that VLMs retain reasoning when decoupled and introduces one new architectural component without external validation.

axioms (1)

domain assumption Fine-tuning end-to-end VLAs on narrow control data compromises inherited VLM reasoning capabilities
This trade-off is stated as the core motivation for the hierarchical split.

invented entities (1)

cascaded cross-attention mechanism no independent evidence
purpose: Sequentially fuses global context, high-resolution object-centric crops, and skill semantics in the DiT action expert
Presented as a novel design element enabling focused execution in the low-level component.

pith-pipeline@v0.9.0 · 5555 in / 1201 out tokens · 58607 ms · 2026-05-12T01:46:22.347057+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VLM planner ... generates structured plans, comprising a subtask instruction and a precise target bounding box... cascaded cross-attention mechanism... sequentially fuses global context, high-resolution object-centric crops and skill semantics
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decoupled architecture preserves the VLM's zero-shot reasoning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 16 internal anchors

[1]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 1, 10, 11, 19

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Belkhale, S., Ding, T., Xiao, T., Sermanet, P., Vuong, Q., Tompson, J., Chebotar, Y.,Dwibedi,D.,Sadigh,D.:Rt-h:Actionhierarchiesusinglanguage.arXivpreprint arXiv:2403.01823 (2024) 2

work page arXiv 2024
[3]

3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks,

Bhat, V., Lan, Y.H., Krishnamurthy, P., Karri, R., Khorrami, F.: 3d cavla: Lever- aging depth and 3d context to generalize vision language action models for unseen tasks. arXiv preprint arXiv:2505.05800 (2025) 4

work page arXiv 2025
[4]

H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523,

Bi, H., Wu, L., Lin, T., Tan, H., Su, Z., Su, H., Zhu, J.: H-rdt: Human manipulation enhanced bimanual robotic manipulation. arXiv preprint arXiv:2507.23523 (2025) 3, 10

work page arXiv 2025
[5]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025) 2, 3, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024) 2, 3, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan, Anthony, e.a.: Rt-2: Vision-language-action models transfer web knowl- edge to robotic control. arXiv preprint arXiv:2307.15818 (2023) 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Towards synergistic, generalized, and efficient dual-system for robotic manipulation.arXiv preprint arXiv:2410.08001, 2024a

Bu, Q., Li, H., Chen, L., Cai, J., Zeng, J., Cui, H., Yao, M., Qiao, Y.: Towards synergistic, generalized, and efficient dual-system for robotic manipulation. arXiv preprint arXiv:2410.08001 (2024) 2

work page arXiv 2024
[9]

Gr-3 technical report.arXiv preprint arXiv:2507.15493,

Cheang, Catherine, e.a.: Gr-3 technical report. arXiv preprint arXiv:2507.15493 (2025) 2

work page arXiv 2025
[10]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Li, Z., Liang, Q., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025) 3, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

Chen, X., Chen, Y., Fu, Y., Gao, N., Jia, J., Jin, W., Li, H., Mu, Y., Pang, J., Qiao, Y., Tian, Y., Wang, B., Wang, B., Wang, F., Wang, H., Wang, T., Wang, Z., Wei, X., Wu, C., Yang, S., Ye, J., Yu, J., Zeng, J., Zhang, J., Zhang, J., Zhang, S., Zheng, F., Zhou, B., Zhu, Y.: Internvla-m1: A spatially guided vision-language- action framework for generalis...

work page internal anchor Pith review arXiv 2025
[12]

GitHub repository (1 2025).https: //doi

starVLA Contributors: Starvla: A lego-like codebase for vision-language-action model developing. GitHub repository (1 2025).https: //doi. org/10 .5281/ zenodo.18264214,https://github.com/starVLA/starVLA10

work page 2025
[13]

Driess, D., Springenberg, J.T., Ichter, B., Yu, L., Li-Bell, A., Pertsch, K., Ren, A.Z., Walke, H., Vuong, Q., Shi, L.X., Levine, S.: Knowledge insulating vision- language-action models: Train fast, run fast, generalize better (2025),https:// arxiv.org/abs/2505.237052, 4

work page arXiv 2025
[14]

Interleave-VLA: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

Fan, C., Jia, X., Sun, Y., Wang, Y., Wei, J., Gong, Z., Zhao, X., Tomizuka, M., Yang, X., Yan, J., et al.: Interleave-vla: Enhancing robot manipulation with in- terleaved image-text instructions. arXiv preprint arXiv:2505.02152 (2025) 2, 4, 10

work page arXiv 2025
[15]

Hancock, A.J., Wu, X., Zha, L., Russakovsky, O., Majumdar, A.: Actions as lan- guage: Fine-tuning vlms into vlas without catastrophic forgetting (2025),https: //arxiv.org/abs/2509.221952, 4

work page arXiv 2025
[16]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Huang, H., Chen, X., Chen, Y., Li, H., Han, X., Wang, Z., Wang, T., Pang, J., Zhao, Z.: Roboground: Robotic manipulation with grounded vision-language pri- ors. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22540–22550 (2025) 4

work page 2025
[17]

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M.Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Ren, A.Z., Shi, L.X., Smith, L., Springenberg, J.T., Stachow...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Jiang, Y., Gu, J., Xue, T., Cheung, K.C., Molchanov, P., Yin, H., Liu, S.: Token- efficient vlm: High-resolution image understanding via dynamic region proposal. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 24147–24158 (October 2025) 2

work page 2025
[19]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025) 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

In: Conference on Robot Learning

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., et al.: Openvla: An open-source vision-language-action model. In: Conference on Robot Learning. pp. 2679–2713. PMLR (2025) 2, 3

work page 2025
[21]

Lai, X., Li, J., Li, W., Liu, T., Li, T., Zhao, H.: Mini-o3: Scaling up reasoning patterns and interaction turns for visual search (2025),https://arxiv.org/abs/ 2509.079692

work page arXiv 2025
[22]

Li, Y., Deng, Y., Zhang, J., Jang, J., Memmel, M., Yu, R., Garrett, C.R., Ramos, F., Fox, D., Li, A., Gupta, A., Goyal, A.: Hamster: Hierarchical action models for open-world robot manipulation (2025),https://arxiv.org/abs/2502.054854

work page arXiv 2025
[23]

Liang, Z., Mu, Y., Ma, H., Tomizuka, M., Ding, M., Luo, P.: Skilldiffuser: Inter- pretable hierarchical planning via skill abstractions in diffusion-based task execu- tion (2024),https://arxiv.org/abs/2312.115982

work page arXiv 2024
[24]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024) 1

work page 2024
[25]

In: European conference on computer vision

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024) 13 HiVLA 17

work page 2024
[26]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 19

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 19

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 3, 4

work page 2023
[29]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

arXiv preprint arXiv:2508.13073 (2025)

Shao, R., Li, W., Zhang, L., Zhang, R., Liu, Z., Chen, R., Nie, L.: Large vlm-based vision-language-action models for robotic manipulation: A survey. arXiv preprint arXiv:2508.13073 (2025) 3

work page arXiv 2025
[31]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Shi, B., Li, B., Cai, H., Lu, Y., Liu, S., Pavone, M., Kautz, J., Han, S., Darrell, T., Molchanov, P., Yin, H.: Scaling vision pre-training to 4k resolution. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9631–9640 (June 2025) 2

work page 2025
[32]

Shi, L.X., Ichter, B., Equi, M., Ke, L., Pertsch, K., Vuong, Q., Tanner, J., Walling, A., Wang, H., Fusai, N., Li-Bell, A., Driess, D., Groom, L., Levine, S., Finn, C.: Hi robot: Open-ended instruction following with hierarchical vision-language-action models (2025),https://arxiv.org/abs/2502.194172, 4, 10

work page arXiv 2025
[33]

ReconVLA: Reconstructive vision-language-action model as effective robot perceiver.arXiv preprint arXiv:2508.10333, 2025

Song, W., Zhou, Z., Zhao, H., Chen, J., Ding, P., Yan, H., Huang, Y., Tang, F., Wang, D., Li, H.: Reconvla: Reconstructive vision-language-action model as effective robot perceiver. arXiv preprint arXiv:2508.10333 (2025) 4

work page arXiv 2025
[34]

Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

Sridhar, A., Pan, J., Sharma, S., Finn, C.: Memer: Scaling up memory for robot control via experience retrieval. arXiv preprint arXiv:2510.20328 (2025) 4

work page arXiv 2025
[35]

arXiv preprint arXiv:2412.03555 (2024) 1

Steiner, A., Pinto, A.S., Tschannen, M., Keysers, D., Wang, X., Bitton, Y., Grit- senko, A., Minderer, M., Sherbondy, A., Long, S., et al.: Paligemma 2: A family of versatile vlms for transfer. arXiv preprint arXiv:2412.03555 (2024) 1

work page arXiv 2024
[36]

Team, G.R., et al.: Gemini robotics: Bringing ai into the physical world (2025), https://arxiv.org/abs/2503.2002010

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 19

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025) 19

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Wu, H., Jing, Y., Cheang, C., Chen, G., Xu, J., Li, X., Liu, M., Li, H., Kong, T.: Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139 (2023) 4

work page internal anchor Pith review arXiv 2023
[40]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhao, Q., Lu, Y., Kim, M.J., Fu, Z., Zhang, Z., Wu, Y., Li, Z., Ma, Q., Han, S., Finn, C., et al.: Cot-vla: Visual chain-of-thought reasoning for vision-language- action models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1702–1713 (2025) 2

work page 2025
[41]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing "thinking with images" via reinforcement learning (2025), https://arxiv.org/abs/2505.143622 18 T. Yang et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Dexgraspvla: A vision-language- action framework towards general dexterous grasping,

Zhong, Y., Huang, X., Li, R., Zhang, C., Chen, Z., Guan, T., Zeng, F., Lui, K.N., Ye, Y., Liang, Y., Yang, Y., Chen, Y.: Dexgraspvla: A vision-language-action framework towards general dexterous grasping (2025),https://arxiv.org/abs/ 2502.209002, 4, 10 HiVLA 19 Supplementary Material 1 DiT Model Details ImplementationDetailsWeimplementedourmodelusingthePy...

work page arXiv 2025