Fast Enough to Act: Spatio-Temporal Visual Token Merging for Low-Latency Robotic VLMs and VLAs

Gang Zhou; Jindong Wang; Junzhou Chen

arxiv: 2606.29350 · v1 · pith:L6TZCUV5new · submitted 2026-06-28 · 💻 cs.CV · cs.AI

Fast Enough to Act: Spatio-Temporal Visual Token Merging for Low-Latency Robotic VLMs and VLAs

Junzhou Chen , Jindong Wang , Gang Zhou This is my paper

Pith reviewed 2026-06-30 07:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual token mergingspatio-temporal fusionvision-language modelsvision-language action modelslow-latency inferencerobotic controltraining-free methodhigh-resolution video

0 comments

The pith

ST-Merge fuses redundant visual tokens across 3D space and time to deliver large inference speedups in robotic VLMs and VLAs while preserving task success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ST-Merge, a plug-and-play method that merges similar visual tokens from high-resolution images and video directly in the encoding stage to reduce the token count passed to vision-language models for robots. It constructs explicit 3D spatiotemporal coordinates for tokens across frames, applies multi-queue parallel matching and weighted aggregation to combine redundancies, and adds a post-merge correction that adjusts positional encodings using the weighted centroid. The aim is to make inference fast enough for real-time robot control without retraining the underlying model. Experiments demonstrate 2 times speedup on Qwen2.5-VL for video question answering with 1 percent precision loss, and 8.3 times speedup on the π0.5 VLA policy at 1024 by 1024 resolution with no drop in success rate. A sympathetic reader would care because token volume currently blocks practical deployment of capable vision-language policies on physical robots.

Core claim

ST-Merge is a training-free framework that fuses redundant tokens during visual encoding by explicitly constructing 3D spatiotemporal coordinates, employing multi-queue parallel matching and weighted aggregation for geometrically consistent fusion across frames, and introducing a post-merge positional correction that dynamically re-evaluates the rotational position code of the weighted centroid to eliminate spatial deviation and maintain the high-precision spatial awareness required for dexterous robotic operation.

What carries the argument

Multi-queue parallel matching on 3D spatiotemporal coordinates combined with weighted-centroid positional correction after merging

If this is right

On Qwen2.5-VL for video QA, the method produces a 2 times inference speedup with only a 1 percent loss in precision.
On the π0.5 VLA policy at 1024 by 1024 resolution, it produces an 8.3 times speedup while matching the baseline task success rate.
The framework operates without any additional training and can be inserted into existing visual encoders.
At lower resolutions the accuracy impact remains small.
High-resolution inputs become practical for real-time robotic policies that previously required downsampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the spatial correction works reliably, the same merging logic could support longer video sequences or multi-camera inputs without proportional compute growth.
The technique might combine with other efficiency methods such as quantization to produce further latency reductions in deployed systems.
Similar coordinate-based merging could be tested on non-robotic real-time vision tasks that also suffer from token overload.

Load-bearing premise

That fusing tokens via similarity in 3D spatiotemporal coordinates and correcting positions with the weighted centroid will not remove or distort the spatial relationships needed for accurate robotic actions.

What would settle it

A controlled experiment showing a clear drop in success rate on a fine-grained manipulation task when ST-Merge is applied at 1024 by 1024 resolution compared with the unmerged baseline at identical resolution.

Figures

Figures reproduced from arXiv: 2606.29350 by Gang Zhou, Jindong Wang, Junzhou Chen.

**Figure 2.** Figure 2: Pipeline of ST-Merge, which can be seamlessly deployed within existing LVLMs. Inserted into shallow layer of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Real-World setup. The image shows the SO-101 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of real-world robotic manipulation on the SO-ARM101 [44] platform, using the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Vision-language models and vision-language action models endow the robot with unprecedented capabilities. However, the input of video and high-resolution images yields a massive number of visual tokens, leading to extremely high inference latency and severely hindering the robot's real-time control. To break through this computational bottleneck, we propose ST-Merge, a plug-and-play, training-free framework that efficiently fuses redundant tokens directly during the visual encoding phase. By explicitly constructing 3D spatiotemporal coordinates, it employs a multi-queue parallel matching and weighted aggregation mechanism to achieve efficient and geometrically consistent fusion of redundant tokens across frames. In addition, we introduce a post-merge positional correction mechanism that effectively eliminates spatial deviation caused by merging by dynamically re-evaluating the rotational position code of the weighted centroid of the vision token, thereby ensuring the high-precision spatial awareness required for dexterous operation. In the Video Question Answering task on the mainstream VLM, Qwen2.5-VL, ST-Merge achieves a 2$\times$ inference speedup with only a tiny 1\% loss in precision. When deployed on the $\pi_{0.5}$ VLA policy, ST-Merge achieves an 8.3$\times$ speedup at 1024 $\times$ 1024 resolution and matches the baseline success rate at this high-resolution setting. At lower resolutions, it introduces a small drop in accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ST-Merge gives a training-free token merge for high-res video in robotic VLAs with reported 8x speedups at full success rate on π0.5, but the experiments lack the ablations needed to trust the spatial correction claim.

read the letter

ST-Merge is a plug-and-play way to cut visual tokens during encoding by building 3D spatiotemporal coordinates, running multi-queue parallel matching, doing weighted aggregation, and then fixing positions with a post-merge centroid correction. The headline result is an 8.3× speedup on the π0.5 VLA at 1024×1024 resolution while matching baseline success rate, plus a 2× speedup on Qwen2.5-VL video QA with 1% precision loss.

The work is straightforward engineering that targets the exact latency problem in real-time robotic control. Being training-free and scoped to the visual encoding stage makes it easy to drop in. The 3D coordinate construction and the explicit positional correction step are the parts that go beyond generic token merging.

The numbers are useful for anyone running high-resolution inputs on robots. The fact that they tested on an actual VLA policy rather than just a VLM is a plus.

The soft spot is the missing experimental detail. The abstract states the speed and accuracy figures but shows no ablations, no variance across runs, and no analysis of how much the post-merge correction actually matters. The stress-test point about whether centroid repositioning preserves fine spatial relations for dexterous actions is reasonable; without position-error metrics or a controlled test that turns the correction off, the matching success rate could be task-specific rather than general. At lower resolutions the method already shows a small accuracy drop, which hints at limits.

This is for people who need to ship real-time vision-language policies on hardware where token count is the blocker. A reader working on robotic deployment would find the reported gains worth checking, provided the full paper supplies the missing controls.

Send it to review. The practical target and the concrete numbers on a real policy are enough to justify referee time even if the validation needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper proposes ST-Merge, a training-free, plug-and-play token-merging framework for VLMs and VLAs. It builds explicit 3D spatiotemporal coordinates, performs multi-queue parallel matching with weighted aggregation to fuse redundant visual tokens during encoding, and applies a post-merge positional correction that recomputes rotational position encodings from the weighted centroid of merged tokens. On Qwen2.5-VL video QA it reports 2× speedup with 1% precision loss; on the π0.5 VLA policy at 1024×1024 resolution it reports 8.3× speedup while matching baseline success rate (with a small accuracy drop at lower resolutions).

Significance. If the empirical claims and the spatial-preservation argument hold under rigorous verification, the method would directly address the latency barrier that currently prevents high-resolution visual input from being used in real-time robotic VLAs, offering a drop-in acceleration technique without retraining.

major comments (2)

[Abstract / §4] Abstract and §4 (experimental claims): the headline results (8.3× speedup and matched success rate on π0.5 at 1024×1024; 2× speedup with 1% loss on Qwen2.5-VL) are stated without any description of baselines, number of trials, variance, error bars, or ablation studies. This absence prevents verification that the observed parity is attributable to the post-merge correction rather than task-specific tolerance.
[Method (post-merge correction)] Method description (post-merge positional correction): the central claim that re-evaluating the rotational position code from the weighted centroid “eliminates spatial deviation” and preserves “high-precision spatial awareness required for dexterous operation” is load-bearing for the robotic application, yet the manuscript provides neither per-token position-error statistics nor an ablation that disables only the correction step. Without these, it is impossible to assess whether sub-pixel shifts introduced by averaging affect fine-grained control policies.

minor comments (2)

[Method] Notation for the 3D coordinate construction and the multi-queue matching algorithm should be formalized with explicit equations rather than prose descriptions to allow reproducibility.
[Abstract] The abstract states “matches the baseline success rate at this high-resolution setting” but does not specify the exact success-rate values or the number of evaluation episodes; these numbers should appear in the main text or a table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to provide the requested verification.

read point-by-point responses

Referee: [Abstract / §4] Abstract and §4 (experimental claims): the headline results (8.3× speedup and matched success rate on π0.5 at 1024×1024; 2× speedup with 1% loss on Qwen2.5-VL) are stated without any description of baselines, number of trials, variance, error bars, or ablation studies. This absence prevents verification that the observed parity is attributable to the post-merge correction rather than task-specific tolerance.

Authors: We agree that the current experimental reporting lacks these details. In the revised manuscript we will expand §4 to specify the baselines, number of trials, variance measures with error bars, and include an ablation isolating the post-merge correction to confirm its contribution to the observed parity. revision: yes
Referee: [Method (post-merge correction)] Method description (post-merge positional correction): the central claim that re-evaluating the rotational position code from the weighted centroid “eliminates spatial deviation” and preserves “high-precision spatial awareness required for dexterous operation” is load-bearing for the robotic application, yet the manuscript provides neither per-token position-error statistics nor an ablation that disables only the correction step. Without these, it is impossible to assess whether sub-pixel shifts introduced by averaging affect fine-grained control policies.

Authors: We acknowledge that the manuscript currently lacks per-token position-error statistics and an ablation disabling only the correction. The revised version will add these: position-error metrics before/after correction and an ablation removing solely the post-merge step, to quantify any effect of averaging on spatial precision and policy performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no self-referential derivations

full rationale

The paper describes ST-Merge as a training-free algorithmic framework using explicit 3D spatiotemporal coordinate construction, multi-queue matching, weighted aggregation, and post-merge centroid correction. No equations, fitted parameters, or derivations appear in the abstract that reduce predictions to inputs by construction. Reported speedups and success rates are framed as direct empirical measurements on Qwen2.5-VL and π0.5, not as quantities defined from self-citations or ansatzes. No load-bearing self-citation chains or uniqueness theorems are invoked. The approach is self-contained as an engineering proposal evaluated against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, mathematical axioms, or new invented entities; the method is described at the level of algorithmic components without derivations or postulated objects.

pith-pipeline@v0.9.1-grok · 5789 in / 1259 out tokens · 55742 ms · 2026-06-30T07:22:08.441105+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 18 canonical work pages · 11 internal anchors

[1]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Ka- dian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al., “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang et al., “Qwen2. 5-vl technical report,” arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu et al., “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Impedancegpt: Vlm-driven impedance control of swarm of mini-drones for intelligent navigation in dynamic environment,

F. Batool, Y. Yaqoot, M. Zafar, R. A. Khan, M. H. Khan, A. Fedoseev, and D. Tsetserukou, “Impedancegpt: Vlm-driven impedance control of swarm of mini-drones for intelligent navigation in dynamic environment,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2592–2597

2025
[8]

Rod-vlm: A framework of real-time robotic perception, reasoning and manipulation,

Y. Zhu, X. Wang, F. Yu, T. Lei, and Y. Sun, “Rod-vlm: A framework of real-time robotic perception, reasoning and manipulation,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 866–873

2025
[9]

On the safety concerns of deploying llms/vlms in robotics: Highlighting the risks and vulnerabilities,

X. Wu, R. Xian, T. Guan, J. Liang, S. Chakraborty, F. Liu, B. M. Sadler, D. Manocha, and A. Bedi, “On the safety concerns of deploying llms/vlms in robotics: Highlighting the risks and vulnerabilities,” in First Vision and Language for Autonomous Driving and Robotics Workshop, 2024

2024
[10]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Bal- akrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. San- keti et al., “Openvla: An open-source vision-language-action model,” arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai et al., “ π0.5: a vision-language-action model with open-world generaliza- tion,” arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

2025
[13]

Video token merging for long-form video understanding,

S.-H. Lee, J. Wang, Z. Zhang, D. Fan, and X. Li, “Video token merging for long-form video understanding,” arXiv preprint arXiv:2410.23782, 2024

work page arXiv 2024
[14]

Pumer: Pruning and merging tokens for eﬀicient vision language models,

Q. Cao, B. Paranjape, and H. Hajishirzi, “Pumer: Pruning and merging tokens for eﬀicient vision language models,” arXiv preprint arXiv:2305.17530, 2023

work page arXiv 2023
[15]

Token Merging: Your ViT But Faster

D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your vit but faster,” arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Boosting multimodal large language models with visual tokens withdrawal for rapid inference,

Z. Lin, M. Lin, L. Lin, and R. Ji, “Boosting multimodal large language models with visual tokens withdrawal for rapid inference,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 2025, pp. 5334–5342

2025
[17]

An image is worth 1/2 tokens after layer 2: Plug- and-play inference acceleration for large vision-language mod- els,

L. Chen, H. Zhao, T. Liu, S. Bai, J. Lin, C. Zhou, and B. Chang, “An image is worth 1/2 tokens after layer 2: Plug- and-play inference acceleration for large vision-language mod- els,” in European Conference on Computer Vision. Springer, 2024, pp. 19–35

2024
[18]

Framefusion: Combining similarity and importance for video token reduction on large vision language models,

T. Fu, T. Liu, Q. Han, G. Dai, S. Yan, H. Yang, X. Ning, and Y. Wang, “Framefusion: Combining similarity and importance for video token reduction on large vision language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22 654–22 663

2025
[19]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,” Advances in Neural Information Processing Sys- tems, vol. 36, pp. 44 776–44 791, 2023

2023
[20]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” 2023

2023
[21]

An advanced driving agent with the multimodal large language model for autonomous vehicles,

J. Chen and S. Lu, “An advanced driving agent with the multimodal large language model for autonomous vehicles,” in 2024 IEEE International Conference on Mobility, Operations, Services and Technologies (MOST). IEEE, 2024, pp. 1–11

2024
[22]

Cliport: What and where pathways for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in Conference on robot learning. PMLR, 2022, pp. 894–906

2022
[23]

Vima: General robot manipulation with multimodal prompts,

Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen, L. Fei-Fei, A. Anandkumar, Y. Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,” in Fortieth International Conference on Machine Learning, 2023

2023
[24]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics,

C. H. Song, V. Blukis, J. Tremblay, S. Tyree, Y. Su, and S. Birchfield, “Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15 768–15 780

2025
[26]

Physvlm: Enabling visual language models to understand robotic physical reachability,

W. Zhou, M. Tao, C. Zhao, H. Guo, H. Dong, M. Tang, and J. Wang, “Physvlm: Enabling visual language models to understand robotic physical reachability,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6940–6949

2025
[27]

Vlmpc: Vision-language model predictive control for robotic manipulation,

W. Zhao, J. Chen, Z. Meng, D. Mao, R. Song, and W. Zhang, “Vlmpc: Vision-language model predictive control for robotic manipulation,” arXiv preprint arXiv:2407.09829, 2024

work page arXiv 2024
[28]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti et al., “Smolvla: A vision-language-action model for affordable and eﬀicient robotics,” arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” in European conference on computer vision. Springer, 2024, pp. 256–274

2024
[30]

Tr-dq: Time-rotation diffusion quantization,

Y. Shao, D. Lin, F. Zeng, M. Yan, M. Zhang, S. Chen, Y. Fan, Z. Yan, H. Wang, J. Guo et al., “Tr-dq: Time-rotation diffusion quantization,” arXiv preprint arXiv:2503.06564, 2025

work page arXiv 2025
[31]

Learning to merge tokens in vision transformers, 2022

C. Renggli, A. S. Pinto, N. Houlsby, B. Mustafa, J. Puigcerver, and C. Riquelme, “Learning to merge tokens in vision trans- formers,” arXiv preprint arXiv:2202.12015, 2022

work page arXiv 2022
[32]

Learned token pruning for transformers,

S. Kim, S. Shen, D. Thorsley, A. Gholami, W. Kwon, J. Hassoun, and K. Keutzer, “Learned token pruning for transformers,” in Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, 2022, pp. 784–794

2022
[33]

Exploring token pruning in vision state space models,

Z. Zhan, Z. Kong, Y. Gong, Y. Wu, Z. Meng, H. Zheng, X. Shen, S. Ioannidis, W. Niu, P. Zhao et al., “Exploring token pruning in vision state space models,” Advances in Neural Information Processing Systems, vol. 37, pp. 50 952–50 971, 2024

2024
[34]

Topv: Compatible token pruning with inference time optimization for fast and low- memory multimodal vision language model,

C. Yang, Y. Sui, J. Xiao, L. Huang, Y. Gong, C. Li, J. Yan, Y. Bai, P. Sadayappan, X. Hu et al., “Topv: Compatible token pruning with inference time optimization for fast and low- memory multimodal vision language model,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 803–19 813

2025
[35]

Flashat- tention: Fast and memory-eﬀicient exact attention with io- awareness,

T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashat- tention: Fast and memory-eﬀicient exact attention with io- awareness,” Advances in neural information processing sys- tems, vol. 35, pp. 16 344–16 359, 2022

2022
[36]

arXiv preprint arXiv:2403.15388 (2024) 10

Y. Shang, M. Cai, B. Xu, Y. J. Lee, and Y. Yan, “Llava- prumerge: Adaptive token reduction for eﬀicient large multi- modal models,” arXiv preprint arXiv:2403.15388, 2024

work page arXiv 2024
[37]

Rotary position embedding for vision transformer,

B. Heo, S. Park, D. Han, and S. Yun, “Rotary position embedding for vision transformer,” in European Conference on Computer Vision. Springer, 2024, pp. 289–305

2024
[38]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day,

C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” Advances in Neural Information Processing Systems, vol. 36, pp. 28 541–28 564, 2023

2023
[39]

Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu et al., “Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185– 24 198

2024
[40]

Roformer: Enhanced transformer with rotary position em- bedding,

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position em- bedding,” Neurocomputing, vol. 568, p. 127063, 2024

2024
[41]

Tempme: Video temporal token merging for eﬀicient text-video retrieval,

L. Shen, T. Hao, T. He, S. Zhao, Y. Zhang, P. Liu, Y. Bao, and G. Ding, “Tempme: Video temporal token merging for eﬀicient text-video retrieval,” arXiv preprint arXiv:2409.01156, 2024

work page arXiv 2024
[42]

lerobot_ π0.5_base,

LeRobot, “lerobot_ π0.5_base,” https://huggingface.co/ lerobot/pi05_base, 2026, hugging Face model repository. Accessed: 2026-02-27

2026
[43]

A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs,

B. Krojer, M. Komeili, C. Ross, Q. Garrido, K. Sinha, N. Bal- las, and M. Assran, “A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs,” arXiv, 2025

2025
[44]

Lerobot: State-of-the-art machine learning for real-world robotics in pytorch,

R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zoui- tine, S. Palma, P. Kooijmans, M. Aractingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf, “Lerobot: State-of-the-art machine learning for real-world robotics in pytorch,” https://github. com/huggingface/lerobot, 2024

2024

[1] [1]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Ka- dian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al., “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang et al., “Qwen2. 5-vl technical report,” arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu et al., “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Impedancegpt: Vlm-driven impedance control of swarm of mini-drones for intelligent navigation in dynamic environment,

F. Batool, Y. Yaqoot, M. Zafar, R. A. Khan, M. H. Khan, A. Fedoseev, and D. Tsetserukou, “Impedancegpt: Vlm-driven impedance control of swarm of mini-drones for intelligent navigation in dynamic environment,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2592–2597

2025

[8] [8]

Rod-vlm: A framework of real-time robotic perception, reasoning and manipulation,

Y. Zhu, X. Wang, F. Yu, T. Lei, and Y. Sun, “Rod-vlm: A framework of real-time robotic perception, reasoning and manipulation,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 866–873

2025

[9] [9]

On the safety concerns of deploying llms/vlms in robotics: Highlighting the risks and vulnerabilities,

X. Wu, R. Xian, T. Guan, J. Liang, S. Chakraborty, F. Liu, B. M. Sadler, D. Manocha, and A. Bedi, “On the safety concerns of deploying llms/vlms in robotics: Highlighting the risks and vulnerabilities,” in First Vision and Language for Autonomous Driving and Robotics Workshop, 2024

2024

[10] [10]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Bal- akrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. San- keti et al., “Openvla: An open-source vision-language-action model,” arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai et al., “ π0.5: a vision-language-action model with open-world generaliza- tion,” arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

2025

[13] [13]

Video token merging for long-form video understanding,

S.-H. Lee, J. Wang, Z. Zhang, D. Fan, and X. Li, “Video token merging for long-form video understanding,” arXiv preprint arXiv:2410.23782, 2024

work page arXiv 2024

[14] [14]

Pumer: Pruning and merging tokens for eﬀicient vision language models,

Q. Cao, B. Paranjape, and H. Hajishirzi, “Pumer: Pruning and merging tokens for eﬀicient vision language models,” arXiv preprint arXiv:2305.17530, 2023

work page arXiv 2023

[15] [15]

Token Merging: Your ViT But Faster

D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your vit but faster,” arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Boosting multimodal large language models with visual tokens withdrawal for rapid inference,

Z. Lin, M. Lin, L. Lin, and R. Ji, “Boosting multimodal large language models with visual tokens withdrawal for rapid inference,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 2025, pp. 5334–5342

2025

[17] [17]

An image is worth 1/2 tokens after layer 2: Plug- and-play inference acceleration for large vision-language mod- els,

L. Chen, H. Zhao, T. Liu, S. Bai, J. Lin, C. Zhou, and B. Chang, “An image is worth 1/2 tokens after layer 2: Plug- and-play inference acceleration for large vision-language mod- els,” in European Conference on Computer Vision. Springer, 2024, pp. 19–35

2024

[18] [18]

Framefusion: Combining similarity and importance for video token reduction on large vision language models,

T. Fu, T. Liu, Q. Han, G. Dai, S. Yan, H. Yang, X. Ning, and Y. Wang, “Framefusion: Combining similarity and importance for video token reduction on large vision language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22 654–22 663

2025

[19] [19]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,” Advances in Neural Information Processing Sys- tems, vol. 36, pp. 44 776–44 791, 2023

2023

[20] [20]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” 2023

2023

[21] [21]

An advanced driving agent with the multimodal large language model for autonomous vehicles,

J. Chen and S. Lu, “An advanced driving agent with the multimodal large language model for autonomous vehicles,” in 2024 IEEE International Conference on Mobility, Operations, Services and Technologies (MOST). IEEE, 2024, pp. 1–11

2024

[22] [22]

Cliport: What and where pathways for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in Conference on robot learning. PMLR, 2022, pp. 894–906

2022

[23] [23]

Vima: General robot manipulation with multimodal prompts,

Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen, L. Fei-Fei, A. Anandkumar, Y. Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,” in Fortieth International Conference on Machine Learning, 2023

2023

[24] [24]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics,

C. H. Song, V. Blukis, J. Tremblay, S. Tyree, Y. Su, and S. Birchfield, “Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15 768–15 780

2025

[26] [26]

Physvlm: Enabling visual language models to understand robotic physical reachability,

W. Zhou, M. Tao, C. Zhao, H. Guo, H. Dong, M. Tang, and J. Wang, “Physvlm: Enabling visual language models to understand robotic physical reachability,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6940–6949

2025

[27] [27]

Vlmpc: Vision-language model predictive control for robotic manipulation,

W. Zhao, J. Chen, Z. Meng, D. Mao, R. Song, and W. Zhang, “Vlmpc: Vision-language model predictive control for robotic manipulation,” arXiv preprint arXiv:2407.09829, 2024

work page arXiv 2024

[28] [28]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti et al., “Smolvla: A vision-language-action model for affordable and eﬀicient robotics,” arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” in European conference on computer vision. Springer, 2024, pp. 256–274

2024

[30] [30]

Tr-dq: Time-rotation diffusion quantization,

Y. Shao, D. Lin, F. Zeng, M. Yan, M. Zhang, S. Chen, Y. Fan, Z. Yan, H. Wang, J. Guo et al., “Tr-dq: Time-rotation diffusion quantization,” arXiv preprint arXiv:2503.06564, 2025

work page arXiv 2025

[31] [31]

Learning to merge tokens in vision transformers, 2022

C. Renggli, A. S. Pinto, N. Houlsby, B. Mustafa, J. Puigcerver, and C. Riquelme, “Learning to merge tokens in vision trans- formers,” arXiv preprint arXiv:2202.12015, 2022

work page arXiv 2022

[32] [32]

Learned token pruning for transformers,

S. Kim, S. Shen, D. Thorsley, A. Gholami, W. Kwon, J. Hassoun, and K. Keutzer, “Learned token pruning for transformers,” in Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, 2022, pp. 784–794

2022

[33] [33]

Exploring token pruning in vision state space models,

Z. Zhan, Z. Kong, Y. Gong, Y. Wu, Z. Meng, H. Zheng, X. Shen, S. Ioannidis, W. Niu, P. Zhao et al., “Exploring token pruning in vision state space models,” Advances in Neural Information Processing Systems, vol. 37, pp. 50 952–50 971, 2024

2024

[34] [34]

Topv: Compatible token pruning with inference time optimization for fast and low- memory multimodal vision language model,

C. Yang, Y. Sui, J. Xiao, L. Huang, Y. Gong, C. Li, J. Yan, Y. Bai, P. Sadayappan, X. Hu et al., “Topv: Compatible token pruning with inference time optimization for fast and low- memory multimodal vision language model,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 803–19 813

2025

[35] [35]

Flashat- tention: Fast and memory-eﬀicient exact attention with io- awareness,

T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashat- tention: Fast and memory-eﬀicient exact attention with io- awareness,” Advances in neural information processing sys- tems, vol. 35, pp. 16 344–16 359, 2022

2022

[36] [36]

arXiv preprint arXiv:2403.15388 (2024) 10

Y. Shang, M. Cai, B. Xu, Y. J. Lee, and Y. Yan, “Llava- prumerge: Adaptive token reduction for eﬀicient large multi- modal models,” arXiv preprint arXiv:2403.15388, 2024

work page arXiv 2024

[37] [37]

Rotary position embedding for vision transformer,

B. Heo, S. Park, D. Han, and S. Yun, “Rotary position embedding for vision transformer,” in European Conference on Computer Vision. Springer, 2024, pp. 289–305

2024

[38] [38]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day,

C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” Advances in Neural Information Processing Systems, vol. 36, pp. 28 541–28 564, 2023

2023

[39] [39]

Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu et al., “Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185– 24 198

2024

[40] [40]

Roformer: Enhanced transformer with rotary position em- bedding,

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position em- bedding,” Neurocomputing, vol. 568, p. 127063, 2024

2024

[41] [41]

Tempme: Video temporal token merging for eﬀicient text-video retrieval,

L. Shen, T. Hao, T. He, S. Zhao, Y. Zhang, P. Liu, Y. Bao, and G. Ding, “Tempme: Video temporal token merging for eﬀicient text-video retrieval,” arXiv preprint arXiv:2409.01156, 2024

work page arXiv 2024

[42] [42]

lerobot_ π0.5_base,

LeRobot, “lerobot_ π0.5_base,” https://huggingface.co/ lerobot/pi05_base, 2026, hugging Face model repository. Accessed: 2026-02-27

2026

[43] [43]

A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs,

B. Krojer, M. Komeili, C. Ross, Q. Garrido, K. Sinha, N. Bal- las, and M. Assran, “A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs,” arXiv, 2025

2025

[44] [44]

Lerobot: State-of-the-art machine learning for real-world robotics in pytorch,

R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zoui- tine, S. Palma, P. Kooijmans, M. Aractingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf, “Lerobot: State-of-the-art machine learning for real-world robotics in pytorch,” https://github. com/huggingface/lerobot, 2024

2024