pith. sign in

arxiv: 2606.29350 · v1 · pith:L6TZCUV5new · submitted 2026-06-28 · 💻 cs.CV · cs.AI

Fast Enough to Act: Spatio-Temporal Visual Token Merging for Low-Latency Robotic VLMs and VLAs

Pith reviewed 2026-06-30 07:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords visual token mergingspatio-temporal fusionvision-language modelsvision-language action modelslow-latency inferencerobotic controltraining-free methodhigh-resolution video
0
0 comments X

The pith

ST-Merge fuses redundant visual tokens across 3D space and time to deliver large inference speedups in robotic VLMs and VLAs while preserving task success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ST-Merge, a plug-and-play method that merges similar visual tokens from high-resolution images and video directly in the encoding stage to reduce the token count passed to vision-language models for robots. It constructs explicit 3D spatiotemporal coordinates for tokens across frames, applies multi-queue parallel matching and weighted aggregation to combine redundancies, and adds a post-merge correction that adjusts positional encodings using the weighted centroid. The aim is to make inference fast enough for real-time robot control without retraining the underlying model. Experiments demonstrate 2 times speedup on Qwen2.5-VL for video question answering with 1 percent precision loss, and 8.3 times speedup on the π0.5 VLA policy at 1024 by 1024 resolution with no drop in success rate. A sympathetic reader would care because token volume currently blocks practical deployment of capable vision-language policies on physical robots.

Core claim

ST-Merge is a training-free framework that fuses redundant tokens during visual encoding by explicitly constructing 3D spatiotemporal coordinates, employing multi-queue parallel matching and weighted aggregation for geometrically consistent fusion across frames, and introducing a post-merge positional correction that dynamically re-evaluates the rotational position code of the weighted centroid to eliminate spatial deviation and maintain the high-precision spatial awareness required for dexterous robotic operation.

What carries the argument

Multi-queue parallel matching on 3D spatiotemporal coordinates combined with weighted-centroid positional correction after merging

If this is right

  • On Qwen2.5-VL for video QA, the method produces a 2 times inference speedup with only a 1 percent loss in precision.
  • On the π0.5 VLA policy at 1024 by 1024 resolution, it produces an 8.3 times speedup while matching the baseline task success rate.
  • The framework operates without any additional training and can be inserted into existing visual encoders.
  • At lower resolutions the accuracy impact remains small.
  • High-resolution inputs become practical for real-time robotic policies that previously required downsampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the spatial correction works reliably, the same merging logic could support longer video sequences or multi-camera inputs without proportional compute growth.
  • The technique might combine with other efficiency methods such as quantization to produce further latency reductions in deployed systems.
  • Similar coordinate-based merging could be tested on non-robotic real-time vision tasks that also suffer from token overload.

Load-bearing premise

That fusing tokens via similarity in 3D spatiotemporal coordinates and correcting positions with the weighted centroid will not remove or distort the spatial relationships needed for accurate robotic actions.

What would settle it

A controlled experiment showing a clear drop in success rate on a fine-grained manipulation task when ST-Merge is applied at 1024 by 1024 resolution compared with the unmerged baseline at identical resolution.

Figures

Figures reproduced from arXiv: 2606.29350 by Gang Zhou, Jindong Wang, Junzhou Chen.

Figure 1
Figure 1. Figure 1: ST-Merge resolves the VLA dilemma between [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of ST-Merge, which can be seamlessly deployed within existing LVLMs. Inserted into shallow layer of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real-World setup. The image shows the SO-101 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of real-world robotic manipulation on the SO-ARM101 [44] platform, using the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Vision-language models and vision-language action models endow the robot with unprecedented capabilities. However, the input of video and high-resolution images yields a massive number of visual tokens, leading to extremely high inference latency and severely hindering the robot's real-time control. To break through this computational bottleneck, we propose ST-Merge, a plug-and-play, training-free framework that efficiently fuses redundant tokens directly during the visual encoding phase. By explicitly constructing 3D spatiotemporal coordinates, it employs a multi-queue parallel matching and weighted aggregation mechanism to achieve efficient and geometrically consistent fusion of redundant tokens across frames. In addition, we introduce a post-merge positional correction mechanism that effectively eliminates spatial deviation caused by merging by dynamically re-evaluating the rotational position code of the weighted centroid of the vision token, thereby ensuring the high-precision spatial awareness required for dexterous operation. In the Video Question Answering task on the mainstream VLM, Qwen2.5-VL, ST-Merge achieves a 2$\times$ inference speedup with only a tiny 1\% loss in precision. When deployed on the $\pi_{0.5}$ VLA policy, ST-Merge achieves an 8.3$\times$ speedup at 1024 $\times$ 1024 resolution and matches the baseline success rate at this high-resolution setting. At lower resolutions, it introduces a small drop in accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ST-Merge, a training-free, plug-and-play token-merging framework for VLMs and VLAs. It builds explicit 3D spatiotemporal coordinates, performs multi-queue parallel matching with weighted aggregation to fuse redundant visual tokens during encoding, and applies a post-merge positional correction that recomputes rotational position encodings from the weighted centroid of merged tokens. On Qwen2.5-VL video QA it reports 2× speedup with 1% precision loss; on the π0.5 VLA policy at 1024×1024 resolution it reports 8.3× speedup while matching baseline success rate (with a small accuracy drop at lower resolutions).

Significance. If the empirical claims and the spatial-preservation argument hold under rigorous verification, the method would directly address the latency barrier that currently prevents high-resolution visual input from being used in real-time robotic VLAs, offering a drop-in acceleration technique without retraining.

major comments (2)
  1. [Abstract / §4] Abstract and §4 (experimental claims): the headline results (8.3× speedup and matched success rate on π0.5 at 1024×1024; 2× speedup with 1% loss on Qwen2.5-VL) are stated without any description of baselines, number of trials, variance, error bars, or ablation studies. This absence prevents verification that the observed parity is attributable to the post-merge correction rather than task-specific tolerance.
  2. [Method (post-merge correction)] Method description (post-merge positional correction): the central claim that re-evaluating the rotational position code from the weighted centroid “eliminates spatial deviation” and preserves “high-precision spatial awareness required for dexterous operation” is load-bearing for the robotic application, yet the manuscript provides neither per-token position-error statistics nor an ablation that disables only the correction step. Without these, it is impossible to assess whether sub-pixel shifts introduced by averaging affect fine-grained control policies.
minor comments (2)
  1. [Method] Notation for the 3D coordinate construction and the multi-queue matching algorithm should be formalized with explicit equations rather than prose descriptions to allow reproducibility.
  2. [Abstract] The abstract states “matches the baseline success rate at this high-resolution setting” but does not specify the exact success-rate values or the number of evaluation episodes; these numbers should appear in the main text or a table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to provide the requested verification.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and §4 (experimental claims): the headline results (8.3× speedup and matched success rate on π0.5 at 1024×1024; 2× speedup with 1% loss on Qwen2.5-VL) are stated without any description of baselines, number of trials, variance, error bars, or ablation studies. This absence prevents verification that the observed parity is attributable to the post-merge correction rather than task-specific tolerance.

    Authors: We agree that the current experimental reporting lacks these details. In the revised manuscript we will expand §4 to specify the baselines, number of trials, variance measures with error bars, and include an ablation isolating the post-merge correction to confirm its contribution to the observed parity. revision: yes

  2. Referee: [Method (post-merge correction)] Method description (post-merge positional correction): the central claim that re-evaluating the rotational position code from the weighted centroid “eliminates spatial deviation” and preserves “high-precision spatial awareness required for dexterous operation” is load-bearing for the robotic application, yet the manuscript provides neither per-token position-error statistics nor an ablation that disables only the correction step. Without these, it is impossible to assess whether sub-pixel shifts introduced by averaging affect fine-grained control policies.

    Authors: We acknowledge that the manuscript currently lacks per-token position-error statistics and an ablation disabling only the correction. The revised version will add these: position-error metrics before/after correction and an ablation removing solely the post-merge step, to quantify any effect of averaging on spatial precision and policy performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no self-referential derivations

full rationale

The paper describes ST-Merge as a training-free algorithmic framework using explicit 3D spatiotemporal coordinate construction, multi-queue matching, weighted aggregation, and post-merge centroid correction. No equations, fitted parameters, or derivations appear in the abstract that reduce predictions to inputs by construction. Reported speedups and success rates are framed as direct empirical measurements on Qwen2.5-VL and π0.5, not as quantities defined from self-citations or ansatzes. No load-bearing self-citation chains or uniqueness theorems are invoked. The approach is self-contained as an engineering proposal evaluated against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, mathematical axioms, or new invented entities; the method is described at the level of algorithmic components without derivations or postulated objects.

pith-pipeline@v0.9.1-grok · 5789 in / 1259 out tokens · 55742 ms · 2026-06-30T07:22:08.441105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 18 canonical work pages · 11 internal anchors

  1. [1]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Ka- dian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

  2. [2]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al., “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024

  4. [4]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang et al., “Qwen2. 5-vl technical report,” arXiv preprint arXiv:2502.13923, 2025

  5. [5]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025

  6. [6]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu et al., “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024

  7. [7]

    Impedancegpt: Vlm-driven impedance control of swarm of mini-drones for intelligent navigation in dynamic environment,

    F. Batool, Y. Yaqoot, M. Zafar, R. A. Khan, M. H. Khan, A. Fedoseev, and D. Tsetserukou, “Impedancegpt: Vlm-driven impedance control of swarm of mini-drones for intelligent navigation in dynamic environment,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2592–2597

  8. [8]

    Rod-vlm: A framework of real-time robotic perception, reasoning and manipulation,

    Y. Zhu, X. Wang, F. Yu, T. Lei, and Y. Sun, “Rod-vlm: A framework of real-time robotic perception, reasoning and manipulation,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 866–873

  9. [9]

    On the safety concerns of deploying llms/vlms in robotics: Highlighting the risks and vulnerabilities,

    X. Wu, R. Xian, T. Guan, J. Liang, S. Chakraborty, F. Liu, B. M. Sadler, D. Manocha, and A. Bedi, “On the safety concerns of deploying llms/vlms in robotics: Highlighting the risks and vulnerabilities,” in First Vision and Language for Autonomous Driving and Robotics Workshop, 2024

  10. [10]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Bal- akrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. San- keti et al., “Openvla: An open-source vision-language-action model,” arXiv preprint arXiv:2406.09246, 2024

  11. [11]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai et al., “ π0.5: a vision-language-action model with open-world generaliza- tion,” arXiv preprint arXiv:2504.16054, 2025

  12. [12]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

  13. [13]

    Video token merging for long-form video understanding,

    S.-H. Lee, J. Wang, Z. Zhang, D. Fan, and X. Li, “Video token merging for long-form video understanding,” arXiv preprint arXiv:2410.23782, 2024

  14. [14]

    Pumer: Pruning and merging tokens for efficient vision language models,

    Q. Cao, B. Paranjape, and H. Hajishirzi, “Pumer: Pruning and merging tokens for efficient vision language models,” arXiv preprint arXiv:2305.17530, 2023

  15. [15]

    Token Merging: Your ViT But Faster

    D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your vit but faster,” arXiv preprint arXiv:2210.09461, 2022

  16. [16]

    Boosting multimodal large language models with visual tokens withdrawal for rapid inference,

    Z. Lin, M. Lin, L. Lin, and R. Ji, “Boosting multimodal large language models with visual tokens withdrawal for rapid inference,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 2025, pp. 5334–5342

  17. [17]

    An image is worth 1/2 tokens after layer 2: Plug- and-play inference acceleration for large vision-language mod- els,

    L. Chen, H. Zhao, T. Liu, S. Bai, J. Lin, C. Zhou, and B. Chang, “An image is worth 1/2 tokens after layer 2: Plug- and-play inference acceleration for large vision-language mod- els,” in European Conference on Computer Vision. Springer, 2024, pp. 19–35

  18. [18]

    Framefusion: Combining similarity and importance for video token reduction on large vision language models,

    T. Fu, T. Liu, Q. Han, G. Dai, S. Yan, H. Yang, X. Ning, and Y. Wang, “Framefusion: Combining similarity and importance for video token reduction on large vision language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22 654–22 663

  19. [19]

    Libero: Benchmarking knowledge transfer for lifelong robot learning,

    B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,” Advances in Neural Information Processing Sys- tems, vol. 36, pp. 44 776–44 791, 2023

  20. [20]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” 2023

  21. [21]

    An advanced driving agent with the multimodal large language model for autonomous vehicles,

    J. Chen and S. Lu, “An advanced driving agent with the multimodal large language model for autonomous vehicles,” in 2024 IEEE International Conference on Mobility, Operations, Services and Technologies (MOST). IEEE, 2024, pp. 1–11

  22. [22]

    Cliport: What and where pathways for robotic manipulation,

    M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in Conference on robot learning. PMLR, 2022, pp. 894–906

  23. [23]

    Vima: General robot manipulation with multimodal prompts,

    Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen, L. Fei-Fei, A. Anandkumar, Y. Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,” in Fortieth International Conference on Machine Learning, 2023

  24. [24]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023

  25. [25]

    Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics,

    C. H. Song, V. Blukis, J. Tremblay, S. Tyree, Y. Su, and S. Birchfield, “Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15 768–15 780

  26. [26]

    Physvlm: Enabling visual language models to understand robotic physical reachability,

    W. Zhou, M. Tao, C. Zhao, H. Guo, H. Dong, M. Tang, and J. Wang, “Physvlm: Enabling visual language models to understand robotic physical reachability,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6940–6949

  27. [27]

    Vlmpc: Vision-language model predictive control for robotic manipulation,

    W. Zhao, J. Chen, Z. Meng, D. Mao, R. Song, and W. Zhang, “Vlmpc: Vision-language model predictive control for robotic manipulation,” arXiv preprint arXiv:2407.09829, 2024

  28. [28]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti et al., “Smolvla: A vision-language-action model for affordable and efficient robotics,” arXiv preprint arXiv:2506.01844, 2025

  29. [29]

    Drivelm: Driving with graph visual question answering,

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” in European conference on computer vision. Springer, 2024, pp. 256–274

  30. [30]

    Tr-dq: Time-rotation diffusion quantization,

    Y. Shao, D. Lin, F. Zeng, M. Yan, M. Zhang, S. Chen, Y. Fan, Z. Yan, H. Wang, J. Guo et al., “Tr-dq: Time-rotation diffusion quantization,” arXiv preprint arXiv:2503.06564, 2025

  31. [31]

    Learning to merge tokens in vision transformers, 2022

    C. Renggli, A. S. Pinto, N. Houlsby, B. Mustafa, J. Puigcerver, and C. Riquelme, “Learning to merge tokens in vision trans- formers,” arXiv preprint arXiv:2202.12015, 2022

  32. [32]

    Learned token pruning for transformers,

    S. Kim, S. Shen, D. Thorsley, A. Gholami, W. Kwon, J. Hassoun, and K. Keutzer, “Learned token pruning for transformers,” in Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, 2022, pp. 784–794

  33. [33]

    Exploring token pruning in vision state space models,

    Z. Zhan, Z. Kong, Y. Gong, Y. Wu, Z. Meng, H. Zheng, X. Shen, S. Ioannidis, W. Niu, P. Zhao et al., “Exploring token pruning in vision state space models,” Advances in Neural Information Processing Systems, vol. 37, pp. 50 952–50 971, 2024

  34. [34]

    Topv: Compatible token pruning with inference time optimization for fast and low- memory multimodal vision language model,

    C. Yang, Y. Sui, J. Xiao, L. Huang, Y. Gong, C. Li, J. Yan, Y. Bai, P. Sadayappan, X. Hu et al., “Topv: Compatible token pruning with inference time optimization for fast and low- memory multimodal vision language model,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 803–19 813

  35. [35]

    Flashat- tention: Fast and memory-efficient exact attention with io- awareness,

    T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashat- tention: Fast and memory-efficient exact attention with io- awareness,” Advances in neural information processing sys- tems, vol. 35, pp. 16 344–16 359, 2022

  36. [36]

    arXiv preprint arXiv:2403.15388 (2024) 10

    Y. Shang, M. Cai, B. Xu, Y. J. Lee, and Y. Yan, “Llava- prumerge: Adaptive token reduction for efficient large multi- modal models,” arXiv preprint arXiv:2403.15388, 2024

  37. [37]

    Rotary position embedding for vision transformer,

    B. Heo, S. Park, D. Han, and S. Yun, “Rotary position embedding for vision transformer,” in European Conference on Computer Vision. Springer, 2024, pp. 289–305

  38. [38]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day,

    C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” Advances in Neural Information Processing Systems, vol. 36, pp. 28 541–28 564, 2023

  39. [39]

    Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu et al., “Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185– 24 198

  40. [40]

    Roformer: Enhanced transformer with rotary position em- bedding,

    J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position em- bedding,” Neurocomputing, vol. 568, p. 127063, 2024

  41. [41]

    Tempme: Video temporal token merging for efficient text-video retrieval,

    L. Shen, T. Hao, T. He, S. Zhao, Y. Zhang, P. Liu, Y. Bao, and G. Ding, “Tempme: Video temporal token merging for efficient text-video retrieval,” arXiv preprint arXiv:2409.01156, 2024

  42. [42]

    lerobot_ π0.5_base,

    LeRobot, “lerobot_ π0.5_base,” https://huggingface.co/ lerobot/pi05_base, 2026, hugging Face model repository. Accessed: 2026-02-27

  43. [43]

    A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs,

    B. Krojer, M. Komeili, C. Ross, Q. Garrido, K. Sinha, N. Bal- las, and M. Assran, “A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs,” arXiv, 2025

  44. [44]

    Lerobot: State-of-the-art machine learning for real-world robotics in pytorch,

    R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zoui- tine, S. Palma, P. Kooijmans, M. Aractingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf, “Lerobot: State-of-the-art machine learning for real-world robotics in pytorch,” https://github. com/huggingface/lerobot, 2024