UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models

Boyong He; Conglin Wang; Jiale Cao; Jianhai Yu; Lige Liu; Lin Sun; Tao Sun; Zhiwei Guan; Zihong Chen; Zongsheng Li

arxiv: 2606.22794 · v1 · pith:DOQA7C65new · submitted 2026-06-22 · 💻 cs.RO

UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models

Lin Sun , Zhiwei Guan , Conglin Wang , Zihong Chen , Jianhai Yu , Zongsheng Li , Boyong He , Tao Sun

show 2 more authors

Jiale Cao Lige Liu

This is my paper

Pith reviewed 2026-06-26 08:55 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-action modelsfast-slow architecturehierarchical layersinference efficiencyrobot manipulationmulti-level supervisionLIBERO benchmark

0 comments

The pith

A single vision-language model backbone can be stratified into fast-to-slow layers to resolve the frequency dilemma in vision-language-action systems and deliver both higher task success and lower inference latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that mainstream fast-slow dual-system vision-language-action models suffer from a frequency dilemma where large gaps cause semantic drift and small gaps lose efficiency gains, while also discarding intermediate features. It proposes solving this by grouping the layers of one vision-language model into progressively slower update rates, rerouting interactions via latent vector inversion to match fast features with fine actions and slow features with coarse planning, and adding multi-level supervision for a coarse-to-fine hierarchy. A sympathetic reader would care because this design keeps rich intermediate representations and temporal context inside one model instead of splitting systems, which could make robot action models both more accurate and faster to run. If the approach holds, it would mean a single backbone suffices for the full range of dynamics needed in manipulation tasks without extra alignment losses.

Core claim

UniFS introduces a unified fast-to-slow architecture that resolves the frequency dilemma through three designs: stratifying VLM layers into groups with progressively decreasing update frequencies so shallow layers capture fast-changing dynamics and deeper layers cache stable semantics; a latent vector inversion mechanism that re-routes multi-scale VLM features to the action expert to align fast-varying representations with fine-grained decoding and slow-varying ones with coarse planning; and a multi-level supervision strategy that enforces coarse-to-fine learning across temporal scales. This enables richer cross-frequency transfer in one backbone while low-frequency paths preserve context ac

What carries the argument

The unified fast-to-slow hierarchical architecture that stratifies VLM layers by decreasing update frequency and uses latent vector inversion to align multi-scale features with action decoding.

If this is right

Richer cross-frequency information transfer occurs inside a single backbone instead of across separate models.
Low-frequency pathways preserve temporal context across multiple steps.
State-of-the-art average success rate of 98.3 percent is reached on LIBERO with average inference latency cut from 36.5 ms to 17.8 ms.
Practical applicability holds on a real Franka robot platform.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stratification idea could be tested on non-robotics tasks that require both fast reactions and slow planning, such as video prediction or real-time decision systems.
If the multi-level supervision is the main driver of gains, then similar supervision patterns might improve other hierarchical models even without frequency changes.
Longer-horizon tasks beyond the current benchmarks could reveal whether the preserved temporal context actually compounds over extended sequences.

Load-bearing premise

Progressively decreasing update frequencies across VLM layers will automatically align shallow fast-changing dynamics with fine-grained action decoding and deeper slow semantics with coarse planning without requiring additional cross-layer alignment losses.

What would settle it

Running the same LIBERO tasks with all layers forced to the same update frequency or with the latent vector inversion removed, and checking whether the 2.5 percent success gain and 2.1 times speedup both disappear, would directly test whether the frequency stratification and inversion are necessary.

Figures

Figures reproduced from arXiv: 2606.22794 by Boyong He, Conglin Wang, Jiale Cao, Jianhai Yu, Lige Liu, Lin Sun, Tao Sun, Zhiwei Guan, Zihong Chen, Zongsheng Li.

**Figure 1.** Figure 1: Motivation and Architecture Overview. (a) The separate-module paradigm transmits only slow latent vectors from the VLM to action expert. (b) The embedded paradigm integrates the action expert directly within the VLM layers. (c) Mainstream methods suffer from rigid hard-linking between separate modules. Our approach unifies these dynamics via multi-frequency latent vector for seamless coordination. 1 Introd… view at source ↗

**Figure 2.** Figure 2: Cosine distance between layer-wise latent representations over time. (a) Layer-wise feature evolution in the π0 model over time steps. (b) Corresponding temporal feature dynamics in the VLA-Adapter. The curves represent different network layers, illustrating how feature stability varies across the depth of the models. layer in π0 and the VLA-Adapter change over time, as shown in [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 3.** Figure 3: Overview of our proposed UniFS. In (a) We divide the VLM into multiple frequency groups, with each group operating at a fixed frequency. In (b), we illustrate how a batch performs temporal substitution under VLA parallel training. In (c), we intuitively illustrate the execution frequencies of different layers in one inference loop, showing a theoretical speedup of about 2.6×. At the fast-frequency scale ff… view at source ↗

**Figure 4.** Figure 4: Visualization of tasks from the four LIBERO benchmark suites and three RealWorld Franka scenarios. similar initial configurations; and (4) LIBERO-10 (g, h), also called LIBEROLong, is a challenging sequential multi-task suite requiring agents to master 10 distinct manipulation skills without catastrophic forgetting. This simulation environment is built on robosuite [55] with a Franka Panda robot model, … view at source ↗

**Figure 5.** Figure 5: Ablation study on sampling frequency configurations. Ablation study on efficiency. Benefiting from the temporal batch sampling strategy with random temporal sampling, the frequency for each group can also be customized during inference, although it is fixed to half that of the preceding group during training. We characterize each configuration by the maximum-to-minimum sampling frequency ratio acros… view at source ↗

**Figure 6.** Figure 6: Layer-wise temporal feature dynamics of UniFS. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Timing analysis. We visualize the inference time of UniFS across continuous timesteps in Libero-Spatial suite. Timing analysis. In [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative visualizations of our proposed UniFS on both LIBERO and realworld experiments [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Mainstream Fast-Slow dual system vision-language-action models decouple a high-frequency action expert from a low-frequency vision-language model for efficiency, yet they face a fundamental frequency dilemma: large update gaps cause semantic drift from stale context, while small gaps erode the intended computational savings. Moreover, because the action expert receives only the VLM's final-layer representation at a single fixed frequency, rich intermediate features are discarded, limiting both information coupling and manipulation precision. Inspired by multi-timescale neural processing in the human brain, we introduce UniFS, a unified fast-to-slow architecture that resolves these challenges through three key designs. First, we stratify the VLM layers into groups with progressively decreasing update frequencies, enabling shallow layers to capture fast-changing dynamics while deeper layers cache stable semantic context. Second, a latent vector inversion mechanism re-routes the interaction order between multi-scale VLM features and the action expert, aligning fast-varying representations with fine-grained action decoding and slow-varying ones with coarse planning. Third, a multi-level supervision strategy enforces a coarse-to-fine learning hierarchy across temporal scales. Together, these designs enable richer cross-frequency information transfer within a single backbone, while the low-frequency pathways additionally preserve temporal context across steps. Experiments on LIBERO show that UniFS achieves state-of-the-art performance (98.3\% average success rate, a 2.5\% gain over VLA-Adapter baseline) while reducing average inference latency from 36.5~ms to 17.8~ms (2.1$\times$ speedup). Real-robot experiments on a Franka platform further validate its practical applicability. Code is opensourced at https://github.com/linsun449/UniFS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniFS claims a single-backbone fix for the fast-slow frequency dilemma in VLA models via layer stratification, latent inversion, and multi-level supervision, with reported 2.1x speedup on LIBERO, but the abstract leaves the alignment mechanism and experimental controls unverified.

read the letter

The main thing to know is that this paper replaces the usual separate fast action expert and slow VLM with one backbone where VLM layers are grouped by decreasing update frequency, a latent vector inversion re-routes multi-scale features to the action expert, and multi-level supervision enforces coarse-to-fine learning. Those three pieces are presented as a unified departure from prior decoupled systems.

It does a reasonable job stating the frequency dilemma and semantic drift problem, then showing how the designs aim to keep rich intermediate features while cutting compute. The LIBERO numbers are specific (98.3% success, 2.5% over VLA-Adapter, latency down to 17.8 ms) and they include real Franka arm tests plus open code, which helps.

The soft spots are exactly where the stress-test note flags. The abstract supplies no ablations or intermediate checks showing that the inversion actually produces the intended fast-dynamics-to-fine-action and slow-semantics-to-coarse-planning mapping rather than just changing access order. Without that, the gains could trace to the supervision schedule or parameter reuse instead. Experimental details on baselines, seeds, and environment variation are also missing, so the 2.1x claim is hard to assess yet.

This is for people working on deployable VLA models who need lower latency without losing accuracy. A reader focused on practical robot control stacks would get value from the architecture description and the open implementation.

It has enough of a concrete claim and open artifacts to deserve a serious referee rather than a desk reject, even though the full paper will need to address the alignment evidence and controls.

Referee Report

2 major / 2 minor

Summary. The manuscript presents UniFS, a unified fast-to-slow hierarchical architecture for vision-language-action (VLA) models. It addresses the frequency dilemma in fast-slow dual-system VLA models by stratifying VLM layers into groups with progressively decreasing update frequencies, introducing a latent vector inversion mechanism to re-route multi-scale features, and applying multi-level supervision for coarse-to-fine learning. Experiments on the LIBERO benchmark report state-of-the-art performance with 98.3% average success rate (2.5% improvement over VLA-Adapter) and a 2.1× reduction in inference latency from 36.5 ms to 17.8 ms, with additional validation on a real Franka robot platform. The code is open-sourced.

Significance. If the performance and efficiency gains prove robust, UniFS offers a promising direction for unifying fast and slow processing within a single VLM backbone, enabling richer cross-frequency information transfer while preserving temporal context. The open-sourced code at the provided GitHub link is a clear strength that supports reproducibility.

major comments (2)

[Experiments on LIBERO] Experiments section: The reported 98.3% average success rate and 2.1× speedup (from 36.5 ms to 17.8 ms) are presented without information on experimental controls, statistical significance, baseline implementation details, or robustness across random seeds and environment variations. This makes it impossible to determine whether the gains are attributable to the three proposed designs rather than other factors.
[Latent vector inversion mechanism] Method section on latent vector inversion: The description states that inversion re-routes interaction to align fast-varying representations with fine-grained action decoding and slow-varying ones with coarse planning, yet no derivation, intermediate feature analysis, or ablation demonstrates that the chosen frequencies plus inversion enforce this mapping (rather than merely reordering access) in the absence of additional cross-layer alignment losses beyond the stated multi-level supervision.

minor comments (2)

[Abstract] The abstract refers to 'rich intermediate features' being discarded in prior work but does not quantify or reference which specific layers or representations are involved.
[Method] Notation for the progressively decreasing update frequencies across layer groups would benefit from an explicit equation or table defining the frequency schedule per group.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. Below we respond point-by-point to the major comments, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Experiments on LIBERO] Experiments section: The reported 98.3% average success rate and 2.1× speedup (from 36.5 ms to 17.8 ms) are presented without information on experimental controls, statistical significance, baseline implementation details, or robustness across random seeds and environment variations. This makes it impossible to determine whether the gains are attributable to the three proposed designs rather than other factors.

Authors: We agree that the current experimental reporting lacks sufficient controls and statistical details. In the revised manuscript we will add: (i) results across 5 random seeds with mean and standard deviation, (ii) explicit baseline re-implementation details including training hyperparameters and hardware, (iii) additional environment variation tests, and (iv) statistical significance measures (e.g., paired t-tests or confidence intervals). These additions will allow readers to better attribute performance gains to the proposed designs. revision: yes
Referee: [Latent vector inversion mechanism] Method section on latent vector inversion: The description states that inversion re-routes interaction to align fast-varying representations with fine-grained action decoding and slow-varying ones with coarse planning, yet no derivation, intermediate feature analysis, or ablation demonstrates that the chosen frequencies plus inversion enforce this mapping (rather than merely reordering access) in the absence of additional cross-layer alignment losses beyond the stated multi-level supervision.

Authors: The inversion is motivated by matching temporal scales to action granularity, with effectiveness shown via end-to-end performance and existing architecture ablations. We acknowledge the value of more direct evidence. In revision we will add an ablation isolating the inversion component and intermediate feature visualizations (e.g., cosine similarity or activation maps before/after inversion) to demonstrate the enforced alignment beyond simple reordering. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from architectural proposal

full rationale

The paper introduces three architectural modifications (layer stratification by update frequency, latent vector inversion, multi-level supervision) to address frequency dilemmas in VLA models and reports empirical gains on LIBERO (98.3% success, 2.1× speedup). No equations, parameter fits, or self-citations appear in the abstract or described claims that would reduce the performance numbers to inputs by construction. The outcome is presented as an experimental consequence of the designs rather than a derived quantity equivalent to the inputs; the central claim remains an independent empirical observation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The designs are described at the level of architectural choices rather than new physical or mathematical postulates.

pith-pipeline@v0.9.1-grok · 5873 in / 1269 out tokens · 22607 ms · 2026-06-26T08:55:05.979391+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 39 canonical work pages · 19 internal anchors

[1]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023) 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

arXiv preprint arXiv:2512.24695 (2025) 2

Behrouz,A.,Razaviyayn,M.,Zhong,P.,Mirrokni,V.:Nestedlearning:Theillusion of deep learning architectures. arXiv preprint arXiv:2512.24695 (2025) 2

work page arXiv 2025
[3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025) 2, 4, 6, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024) 2, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

arXiv preprint arXiv:2410.08001 (2024) 4, 6

Bu, Q., Li, H., Chen, L., Cai, J., Zeng, J., Cui, H., Yao, M., Qiao, Y.: Towards synergistic, generalized, and efficient dual-system for robotic manipulation. arXiv preprint arXiv:2410.08001 (2024) 4, 6

work page arXiv 2024
[6]

WorldVLA: Towards Autoregressive Action World Model

Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025) 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

arXiv preprint arXiv:2506.01953 (2025) 2

Chen, H., Liu, J., Gu, C., Liu, Z., Zhang, R., Li, X., He, X., Guo, Y., Fu, C.W., Zhang, S., et al.: Fast-in-slow: A dual-system foundation model unifying fast ma- nipulation within slow reasoning. arXiv preprint arXiv:2506.01953 (2025) 2

work page arXiv 2025
[8]

In: European Conference on Computer Vision

Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In: European Conference on Computer Vision. pp. 19–35. Springer (2024) 2

2024
[9]

The International Journal of Robotics Research44(10-11), 1684–1704 (2025) 12

Chi,C.,Xu,Z.,Feng,S.,Cousineau,E.,Du,Y.,Burchfiel,B.,Tedrake,R.,Song,S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025) 12

2025
[10]

In: 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids)

Chignoli, M., Kim, D., Stanger-Jones, E., Kim, S.: The mit humanoid robot: De- sign, motion planning, and control for acrobatic behaviors. In: 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids). pp. 1–8. IEEE (2021) 2

2020
[11]

arXiv preprint arXiv:2505.03912 (2025) 2, 4 16 L Sun et al

Cui, C., Ding, P., Song, W., Bai, S., Tong, X., Ge, Z., Suo, R., Zhou, W., Liu, Y., Jia, B., et al.: Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation. arXiv preprint arXiv:2505.03912 (2025) 2, 4 16 L Sun et al

work page arXiv 2025
[12]

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

Deng, S., Yan, M., Wei, S., Ma, H., Yang, Y., Chen, J., Zhang, Z., Yang, T., Zhang, X., Zhang, W., et al.: Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data. arXiv preprint arXiv:2505.03233 (2025) 2, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Himoe-vla: Hierarchical mixture-of- experts for generalist vision-language-action policies,

Du, Z., Liu, B., Liang, Y., Shen, Y., Cao, H., Zheng, X., Feng, Z., Wu, Z., Yang, J., Jiang, Y.G.: Himoe-vla: Hierarchical mixture-of-experts for generalist vision- language-action policies. arXiv preprint arXiv:2512.05693 (2025) 6

work page arXiv 2025
[14]

arXiv preprint arXiv:2410.15549 (2024) 2, 6

Han, B., Kim, J., Jang, J.: A dual process vla: Efficient robotic manipulation leveraging vlm. arXiv preprint arXiv:2410.15549 (2024) 2, 6

work page arXiv 2024
[15]

arXiv preprint arXiv:2312.08782 (2023) 2

Hu, Y., Xie, Q., Jain, V., Francis, J., Patrikar, J., Keetha, N., Kim, S., Xie, Y., Zhang,T.,Fang,H.S.,etal.:Towardgeneral-purposerobotsviafoundationmodels: A survey and meta-analysis. arXiv preprint arXiv:2312.08782 (2023) 2

work page arXiv 2023
[16]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Fron- tiers in human neuroscience13, 426 (2019) 5

Jerath, R., Beveridge, C., Jensen, M.: On the hierarchical organization of oscilla- tory assemblies: layered superimposition and a global bioelectric framework. Fron- tiers in human neuroscience13, 426 (2019) 5

2019
[18]

arXiv preprint arXiv:2509.12594 (2025) 2

Jiang, T., Jiang, X., Ma, Y., Wen, X., Li, B., Zhan, K., Jia, P., Liu, Y., Sun, S., Lang, X.: The better you learn, the smarter you prune: Towards efficient vision-language-action models via differentiable token pruning. arXiv preprint arXiv:2509.12594 (2025) 2

work page arXiv 2025
[19]

macmillan (2011) 2, 4

Kahneman, D.: Thinking, fast and slow. macmillan (2011) 2, 4

2011
[20]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025) 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024) 2, 4, 7, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Li, W., Zhang, R., Shao, R., He, J., Nie, L.: Cogvla: Cognition-aligned vision- language-action model via instruction-driven routing & sparsification. arXiv preprint arXiv:2508.21046 (2025) 2, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

arXiv preprint arXiv:2506.12723 (2025) 2, 12

Li, Y., Meng, Y., Sun, Z., Ji, K., Tang, C., Fan, J., Ma, X., Xia, S., Wang, Z., Zhu, W.: Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration. arXiv preprint arXiv:2506.12723 (2025) 2, 12

work page arXiv 2025
[24]

arXiv preprint arXiv:2502.05485 (2025) 6

Li, Y., Deng, Y., Zhang, J., Jang, J., Memmel, M., Yu, R., Garrett, C.R., Ramos, F., Fox, D., Li, A., et al.: Hamster: Hierarchical action models for open-world robot manipulation. arXiv preprint arXiv:2502.05485 (2025) 6

work page arXiv 2025
[25]

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

Li, Z., Hu, B., Shao, R., Chen, G., Jiang, D., Xie, P., Hao, J., Nie, L.: Global prior meets local consistency: Dual-memory augmented vision-language-action model for efficient robotic manipulation. arXiv preprint arXiv:2602.20200 (2026) 6

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Advances in Neural Information Processing Systems36, 44776–44791 (2023) 10

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023) 10

2023
[27]

arXiv preprint arXiv:2505.07634 (2025) 5

Liu, J., Shi, X., Nguyen, T.D., Zhang, H., Zhang, T., Sun, W., Li, Y., Vasilakos, A.V.,Iacca,G.,Khan,A.A.,etal.:Neuralbrain:aneuroscience-inspiredframework for embodied agents. arXiv preprint arXiv:2505.07634 (2025) 5

work page arXiv 2025
[28]

A Survey on Vision-Language-Action Models for Embodied AI

Ma, Y., Song, Z., Zhuang, Y., Hao, J., King, I.: A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093 (2024) 2 UniFS 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025) 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

arXiv preprint arXiv:2508.21112 (2025) 12

Qu, D., Song, H., Chen, Q., Chen, Z., Gao, X., Ye, X., Lv, Q., Shi, M., Ren, G., Ruan, C., et al.: Eo-1: Interleaved vision-text-action pretraining for general robot control. arXiv preprint arXiv:2508.21112 (2025) 12

work page arXiv 2025
[32]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., et al.: Spatialvla: Exploring spatial representations for visual-language- action model. arXiv preprint arXiv:2501.15830 (2025) 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Vision-language-action (vla) models: Concepts, progress, applications and challenges,

Sapkota, R., Cao, Y., Roumeliotis, K.I., Karkee, M.: Vision-language-action (vla) models: Concepts, progress, applications and challenges. arXiv preprint arXiv:2505.04769 (2025) 5

work page arXiv 2025
[34]

Nature Reviews Neuroscience25(9), 625–642 (2024) 5

Senkowski, D., Engel, A.K.: Multi-timescale neural dynamics for multisensory in- tegration. Nature Reviews Neuroscience25(9), 625–642 (2024) 5

2024
[35]

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Shi, H., Xie, B., Liu, Y., Sun, L., Liu, F., Wang, T., Zhou, E., Fan, H., Zhang, X., Huang, G.: Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236 (2025) 2, 4, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Trends in cognitive sciences23(7), 572–583 (2019) 5

Shine, J.M.: Neuromodulatory influences on integration and segregation in the brain. Trends in cognitive sciences23(7), 572–583 (2019) 5

2019
[37]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zoui- tine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al.: Smolvla: A vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844 (2025) 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Frontiers in Psychology17, 1704370 (2026) 5

Snyder, A.C.: Resonant hierarchies: a multiscale framework for oscillatory dynam- ics in the brain. Frontiers in Psychology17, 1704370 (2026) 5

2026
[39]

arXiv preprint arXiv:2505.21432 (2025) 2

Song, H., Qu, D., Yao, Y., Chen, Q., Lv, Q., Tang, Y., Shi, M., Ren, G., Yao, M., Zhao, B., et al.: Hume: Introducing system-2 thinking in visual-language-action model. arXiv preprint arXiv:2505.21432 (2025) 2

work page arXiv 2025
[40]

In: 2025 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS)

Song, W., Chen, J., Ding, P., Zhao, H., Zhao, W., Zhong, Z., Ge, Z., Li, Z., Wang, D., Wang, L., et al.: Pd-vla: Accelerating vision-language-action model integrated with action chunking via parallel decoding. In: 2025 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS). pp. 13162–13169. IEEE (2025) 12

2025
[41]

Think Twice, Act Once: Token-aware compression and action reuse for efficient inference in vision-language-action models,

Tan, X., Yang, Y., Ye, P., Zheng, J., Bai, B., Wang, X., Hao, J., Chen, T.: Think twice, act once: Token-aware compression and action reuse for efficient inference in vision-language-action models. arXiv preprint arXiv:2505.21200 (2025) 2

work page arXiv 2025
[42]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

arXiv preprint arXiv:2509.09372 (2025) 7, 12

Wang, Y., Ding, P., Li, L., Cui, C., Ge, Z., Tong, X., Song, W., Zhao, H., Zhao, W., Hou, P., et al.: Vla-adapter: An effective paradigm for tiny-scale vision-language- action model. arXiv preprint arXiv:2509.09372 (2025) 7, 12

work page arXiv 2025
[44]

arXiv preprint arXiv:2412.03293 (2024) 4

Wen, J., Zhu, M., Zhu, Y., Tang, Z., Li, J., Zhou, Z., Li, C., Liu, X., Peng, Y., Shen, C., et al.: Diffusion-vla: Generalizable and interpretable robot foundation model via self-generated reasoning. arXiv preprint arXiv:2412.03293 (2024) 4

work page arXiv 2024
[45]

IEEE Robotics and Automation Letters (2025) 2 18 L Sun et al

Wen, J., Zhu, Y., Li, J., Zhu, M., Tang, Z., Wu, K., Xu, Z., Liu, N., Cheng, R., Shen, C., et al.: Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters (2025) 2 18 L Sun et al

2025
[46]

arXiv e-prints pp

Xu, S., Wang, Y., Xia, C., Zhu, D., Huang, T., Xu, C.: Vla-cache: Towards efficient vision-language-action model via adaptive token caching in robotic manipulation. arXiv e-prints pp. arXiv–2502 (2025) 12

2025
[47]

arXiv preprint arXiv:2510.24795 (2025) 5

Yu, Z., Wang, B., Zeng, P., Zhang, H., Zhang, J., Gao, L., Song, J., Sebe, N., Shen, H.T.: A survey on efficient vision-language-action models. arXiv preprint arXiv:2510.24795 (2025) 5

work page arXiv 2025
[48]

Advances in Neural Information Processing Systems37, 56619–56643 (2024) 2

Yue, Y., Wang, Y., Kang, B., Han, Y., Wang, S., Song, S., Feng, J., Huang, G.: Deer-vla:Dynamicinferenceofmultimodallargelanguagemodelsforefficientrobot execution. Advances in Neural Information Processing Systems37, 56619–56643 (2024) 2

2024
[49]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023) 7

2023
[50]

Pure vision language action (vla) models: A comprehensive survey,

Zhang, D., Sun, J., Hu, C., Wu, X., Yuan, Z., Zhou, R., Shen, F., Zhou, Q.: Pure vision language action (vla) models: A comprehensive survey. arXiv preprint arXiv:2509.19012 (2025) 5

work page arXiv 2025
[51]

arXiv preprint arXiv:2410.05273 (2024) 4

Zhang, J., Guo, Y., Chen, X., Wang, Y.J., Hu, Y., Shi, C., Chen, J.: Hirt: Enhancing robotic control with hierarchical robot transformers. arXiv preprint arXiv:2410.05273 (2024) 4

work page arXiv 2024
[52]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhao, Q., Lu, Y., Kim, M.J., Fu, Z., Zhang, Z., Wu, Y., Li, Z., Ma, Q., Han, S., Finn, C., et al.: Cot-vla: Visual chain-of-thought reasoning for vision-language- action models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1702–1713 (2025) 2, 12

2025
[53]

Zheng, J., Li, J., Liu, D., Zheng, Y., Wang, Z., Ou, Z., Liu, Y., Liu, J., Zhang, Y.Q., Zhan,X.:Universalactionsforenhancedembodiedfoundationmodels.In:Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 22508–22519 (2025) 12

2025
[54]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Zheng, J., Li, J., Wang, Z., Liu, D., Kang, X., Feng, Y., Zheng, Y., Zou, J., Chen, Y., Zeng, J., et al.: X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274 (2025) 2, 4, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Zhu, Y., Wong, J., Mandlekar, A., Martín-Martín, R., Joshi, A., Lin, K., Mad- dukuri, A., Nasiriany, S., Zhu, Y.: robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293 (2020) 11

work page internal anchor Pith review Pith/arXiv arXiv 2009
[56]

In: Conference on Robot Learning

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023) 2, 4 UniFS 1 A Detailed Architecture As shown in Table 4, in our implementation, we find that the vision e...

2023

[1] [1]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023) 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

arXiv preprint arXiv:2512.24695 (2025) 2

Behrouz,A.,Razaviyayn,M.,Zhong,P.,Mirrokni,V.:Nestedlearning:Theillusion of deep learning architectures. arXiv preprint arXiv:2512.24695 (2025) 2

work page arXiv 2025

[3] [3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025) 2, 4, 6, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024) 2, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

arXiv preprint arXiv:2410.08001 (2024) 4, 6

Bu, Q., Li, H., Chen, L., Cai, J., Zeng, J., Cui, H., Yao, M., Qiao, Y.: Towards synergistic, generalized, and efficient dual-system for robotic manipulation. arXiv preprint arXiv:2410.08001 (2024) 4, 6

work page arXiv 2024

[6] [6]

WorldVLA: Towards Autoregressive Action World Model

Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025) 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

arXiv preprint arXiv:2506.01953 (2025) 2

Chen, H., Liu, J., Gu, C., Liu, Z., Zhang, R., Li, X., He, X., Guo, Y., Fu, C.W., Zhang, S., et al.: Fast-in-slow: A dual-system foundation model unifying fast ma- nipulation within slow reasoning. arXiv preprint arXiv:2506.01953 (2025) 2

work page arXiv 2025

[8] [8]

In: European Conference on Computer Vision

Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In: European Conference on Computer Vision. pp. 19–35. Springer (2024) 2

2024

[9] [9]

The International Journal of Robotics Research44(10-11), 1684–1704 (2025) 12

Chi,C.,Xu,Z.,Feng,S.,Cousineau,E.,Du,Y.,Burchfiel,B.,Tedrake,R.,Song,S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025) 12

2025

[10] [10]

In: 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids)

Chignoli, M., Kim, D., Stanger-Jones, E., Kim, S.: The mit humanoid robot: De- sign, motion planning, and control for acrobatic behaviors. In: 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids). pp. 1–8. IEEE (2021) 2

2020

[11] [11]

arXiv preprint arXiv:2505.03912 (2025) 2, 4 16 L Sun et al

Cui, C., Ding, P., Song, W., Bai, S., Tong, X., Ge, Z., Suo, R., Zhou, W., Liu, Y., Jia, B., et al.: Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation. arXiv preprint arXiv:2505.03912 (2025) 2, 4 16 L Sun et al

work page arXiv 2025

[12] [12]

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

Deng, S., Yan, M., Wei, S., Ma, H., Yang, Y., Chen, J., Zhang, Z., Yang, T., Zhang, X., Zhang, W., et al.: Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data. arXiv preprint arXiv:2505.03233 (2025) 2, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Himoe-vla: Hierarchical mixture-of- experts for generalist vision-language-action policies,

Du, Z., Liu, B., Liang, Y., Shen, Y., Cao, H., Zheng, X., Feng, Z., Wu, Z., Yang, J., Jiang, Y.G.: Himoe-vla: Hierarchical mixture-of-experts for generalist vision- language-action policies. arXiv preprint arXiv:2512.05693 (2025) 6

work page arXiv 2025

[14] [14]

arXiv preprint arXiv:2410.15549 (2024) 2, 6

Han, B., Kim, J., Jang, J.: A dual process vla: Efficient robotic manipulation leveraging vlm. arXiv preprint arXiv:2410.15549 (2024) 2, 6

work page arXiv 2024

[15] [15]

arXiv preprint arXiv:2312.08782 (2023) 2

Hu, Y., Xie, Q., Jain, V., Francis, J., Patrikar, J., Keetha, N., Kim, S., Xie, Y., Zhang,T.,Fang,H.S.,etal.:Towardgeneral-purposerobotsviafoundationmodels: A survey and meta-analysis. arXiv preprint arXiv:2312.08782 (2023) 2

work page arXiv 2023

[16] [16]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Fron- tiers in human neuroscience13, 426 (2019) 5

Jerath, R., Beveridge, C., Jensen, M.: On the hierarchical organization of oscilla- tory assemblies: layered superimposition and a global bioelectric framework. Fron- tiers in human neuroscience13, 426 (2019) 5

2019

[18] [18]

arXiv preprint arXiv:2509.12594 (2025) 2

Jiang, T., Jiang, X., Ma, Y., Wen, X., Li, B., Zhan, K., Jia, P., Liu, Y., Sun, S., Lang, X.: The better you learn, the smarter you prune: Towards efficient vision-language-action models via differentiable token pruning. arXiv preprint arXiv:2509.12594 (2025) 2

work page arXiv 2025

[19] [19]

macmillan (2011) 2, 4

Kahneman, D.: Thinking, fast and slow. macmillan (2011) 2, 4

2011

[20] [20]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025) 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024) 2, 4, 7, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Li, W., Zhang, R., Shao, R., He, J., Nie, L.: Cogvla: Cognition-aligned vision- language-action model via instruction-driven routing & sparsification. arXiv preprint arXiv:2508.21046 (2025) 2, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

arXiv preprint arXiv:2506.12723 (2025) 2, 12

Li, Y., Meng, Y., Sun, Z., Ji, K., Tang, C., Fan, J., Ma, X., Xia, S., Wang, Z., Zhu, W.: Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration. arXiv preprint arXiv:2506.12723 (2025) 2, 12

work page arXiv 2025

[24] [24]

arXiv preprint arXiv:2502.05485 (2025) 6

Li, Y., Deng, Y., Zhang, J., Jang, J., Memmel, M., Yu, R., Garrett, C.R., Ramos, F., Fox, D., Li, A., et al.: Hamster: Hierarchical action models for open-world robot manipulation. arXiv preprint arXiv:2502.05485 (2025) 6

work page arXiv 2025

[25] [25]

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

Li, Z., Hu, B., Shao, R., Chen, G., Jiang, D., Xie, P., Hao, J., Nie, L.: Global prior meets local consistency: Dual-memory augmented vision-language-action model for efficient robotic manipulation. arXiv preprint arXiv:2602.20200 (2026) 6

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Advances in Neural Information Processing Systems36, 44776–44791 (2023) 10

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023) 10

2023

[27] [27]

arXiv preprint arXiv:2505.07634 (2025) 5

Liu, J., Shi, X., Nguyen, T.D., Zhang, H., Zhang, T., Sun, W., Li, Y., Vasilakos, A.V.,Iacca,G.,Khan,A.A.,etal.:Neuralbrain:aneuroscience-inspiredframework for embodied agents. arXiv preprint arXiv:2505.07634 (2025) 5

work page arXiv 2025

[28] [28]

A Survey on Vision-Language-Action Models for Embodied AI

Ma, Y., Song, Z., Zhuang, Y., Hao, J., King, I.: A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093 (2024) 2 UniFS 17

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025) 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

arXiv preprint arXiv:2508.21112 (2025) 12

Qu, D., Song, H., Chen, Q., Chen, Z., Gao, X., Ye, X., Lv, Q., Shi, M., Ren, G., Ruan, C., et al.: Eo-1: Interleaved vision-text-action pretraining for general robot control. arXiv preprint arXiv:2508.21112 (2025) 12

work page arXiv 2025

[32] [32]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., et al.: Spatialvla: Exploring spatial representations for visual-language- action model. arXiv preprint arXiv:2501.15830 (2025) 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Vision-language-action (vla) models: Concepts, progress, applications and challenges,

Sapkota, R., Cao, Y., Roumeliotis, K.I., Karkee, M.: Vision-language-action (vla) models: Concepts, progress, applications and challenges. arXiv preprint arXiv:2505.04769 (2025) 5

work page arXiv 2025

[34] [34]

Nature Reviews Neuroscience25(9), 625–642 (2024) 5

Senkowski, D., Engel, A.K.: Multi-timescale neural dynamics for multisensory in- tegration. Nature Reviews Neuroscience25(9), 625–642 (2024) 5

2024

[35] [35]

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Shi, H., Xie, B., Liu, Y., Sun, L., Liu, F., Wang, T., Zhou, E., Fan, H., Zhang, X., Huang, G.: Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236 (2025) 2, 4, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Trends in cognitive sciences23(7), 572–583 (2019) 5

Shine, J.M.: Neuromodulatory influences on integration and segregation in the brain. Trends in cognitive sciences23(7), 572–583 (2019) 5

2019

[37] [37]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zoui- tine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al.: Smolvla: A vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844 (2025) 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Frontiers in Psychology17, 1704370 (2026) 5

Snyder, A.C.: Resonant hierarchies: a multiscale framework for oscillatory dynam- ics in the brain. Frontiers in Psychology17, 1704370 (2026) 5

2026

[39] [39]

arXiv preprint arXiv:2505.21432 (2025) 2

Song, H., Qu, D., Yao, Y., Chen, Q., Lv, Q., Tang, Y., Shi, M., Ren, G., Yao, M., Zhao, B., et al.: Hume: Introducing system-2 thinking in visual-language-action model. arXiv preprint arXiv:2505.21432 (2025) 2

work page arXiv 2025

[40] [40]

In: 2025 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS)

Song, W., Chen, J., Ding, P., Zhao, H., Zhao, W., Zhong, Z., Ge, Z., Li, Z., Wang, D., Wang, L., et al.: Pd-vla: Accelerating vision-language-action model integrated with action chunking via parallel decoding. In: 2025 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS). pp. 13162–13169. IEEE (2025) 12

2025

[41] [41]

Think Twice, Act Once: Token-aware compression and action reuse for efficient inference in vision-language-action models,

Tan, X., Yang, Y., Ye, P., Zheng, J., Bai, B., Wang, X., Hao, J., Chen, T.: Think twice, act once: Token-aware compression and action reuse for efficient inference in vision-language-action models. arXiv preprint arXiv:2505.21200 (2025) 2

work page arXiv 2025

[42] [42]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

arXiv preprint arXiv:2509.09372 (2025) 7, 12

Wang, Y., Ding, P., Li, L., Cui, C., Ge, Z., Tong, X., Song, W., Zhao, H., Zhao, W., Hou, P., et al.: Vla-adapter: An effective paradigm for tiny-scale vision-language- action model. arXiv preprint arXiv:2509.09372 (2025) 7, 12

work page arXiv 2025

[44] [44]

arXiv preprint arXiv:2412.03293 (2024) 4

Wen, J., Zhu, M., Zhu, Y., Tang, Z., Li, J., Zhou, Z., Li, C., Liu, X., Peng, Y., Shen, C., et al.: Diffusion-vla: Generalizable and interpretable robot foundation model via self-generated reasoning. arXiv preprint arXiv:2412.03293 (2024) 4

work page arXiv 2024

[45] [45]

IEEE Robotics and Automation Letters (2025) 2 18 L Sun et al

Wen, J., Zhu, Y., Li, J., Zhu, M., Tang, Z., Wu, K., Xu, Z., Liu, N., Cheng, R., Shen, C., et al.: Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters (2025) 2 18 L Sun et al

2025

[46] [46]

arXiv e-prints pp

Xu, S., Wang, Y., Xia, C., Zhu, D., Huang, T., Xu, C.: Vla-cache: Towards efficient vision-language-action model via adaptive token caching in robotic manipulation. arXiv e-prints pp. arXiv–2502 (2025) 12

2025

[47] [47]

arXiv preprint arXiv:2510.24795 (2025) 5

Yu, Z., Wang, B., Zeng, P., Zhang, H., Zhang, J., Gao, L., Song, J., Sebe, N., Shen, H.T.: A survey on efficient vision-language-action models. arXiv preprint arXiv:2510.24795 (2025) 5

work page arXiv 2025

[48] [48]

Advances in Neural Information Processing Systems37, 56619–56643 (2024) 2

Yue, Y., Wang, Y., Kang, B., Han, Y., Wang, S., Song, S., Feng, J., Huang, G.: Deer-vla:Dynamicinferenceofmultimodallargelanguagemodelsforefficientrobot execution. Advances in Neural Information Processing Systems37, 56619–56643 (2024) 2

2024

[49] [49]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023) 7

2023

[50] [50]

Pure vision language action (vla) models: A comprehensive survey,

Zhang, D., Sun, J., Hu, C., Wu, X., Yuan, Z., Zhou, R., Shen, F., Zhou, Q.: Pure vision language action (vla) models: A comprehensive survey. arXiv preprint arXiv:2509.19012 (2025) 5

work page arXiv 2025

[51] [51]

arXiv preprint arXiv:2410.05273 (2024) 4

Zhang, J., Guo, Y., Chen, X., Wang, Y.J., Hu, Y., Shi, C., Chen, J.: Hirt: Enhancing robotic control with hierarchical robot transformers. arXiv preprint arXiv:2410.05273 (2024) 4

work page arXiv 2024

[52] [52]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhao, Q., Lu, Y., Kim, M.J., Fu, Z., Zhang, Z., Wu, Y., Li, Z., Ma, Q., Han, S., Finn, C., et al.: Cot-vla: Visual chain-of-thought reasoning for vision-language- action models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1702–1713 (2025) 2, 12

2025

[53] [53]

Zheng, J., Li, J., Liu, D., Zheng, Y., Wang, Z., Ou, Z., Liu, Y., Liu, J., Zhang, Y.Q., Zhan,X.:Universalactionsforenhancedembodiedfoundationmodels.In:Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 22508–22519 (2025) 12

2025

[54] [54]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Zheng, J., Li, J., Wang, Z., Liu, D., Kang, X., Feng, Y., Zheng, Y., Zou, J., Chen, Y., Zeng, J., et al.: X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274 (2025) 2, 4, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Zhu, Y., Wong, J., Mandlekar, A., Martín-Martín, R., Joshi, A., Lin, K., Mad- dukuri, A., Nasiriany, S., Zhu, Y.: robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293 (2020) 11

work page internal anchor Pith review Pith/arXiv arXiv 2009

[56] [56]

In: Conference on Robot Learning

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023) 2, 4 UniFS 1 A Detailed Architecture As shown in Table 4, in our implementation, we find that the vision e...

2023