Recognition: 2 theorem links
· Lean TheoremRotVLA: Rotational Latent Action for Vision-Language-Action Model
Pith reviewed 2026-05-14 17:44 UTC · model grok-4.3
The pith
RotVLA replaces discrete action codes with continuous rotations in SO(n) for vision-language-action models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Latent actions modeled as elements of SO(n) together with a triplet-frame objective replace discrete quantization pipelines, delivering continuity, compositionality, and physically grounded structure while avoiding trivial reconstruction. The pretrained VLM backbone plus flow-matching action head uses these latent rotations as planners that condition unified denoising of robot actions, achieving the reported benchmark numbers with 1.7B parameters and 1700+ hours of data.
What carries the argument
Continuous rotational latent actions represented as elements of SO(n), learned under a triplet-frame objective that supplies temporal dynamics without collapse.
If this is right
- Latent rotations can be composed and interpolated without discretization artifacts during planning.
- The same latent space serves as a unified planner across embodiments once the flow-matching head is trained.
- Performance remains high under both clean and randomized visual conditions on multiple manipulation suites.
- Real-world deployment shows consistent gains over existing VLA baselines without extra embodiment-specific tuning.
Where Pith is reading between the lines
- The geometric prior may transfer to other sequence tasks that require physically plausible interpolation, such as video prediction.
- If SO(n) structure proves robust, similar rotational embeddings could replace codebooks in non-robotics domains like motion capture or physics simulation.
- Further scaling of the pretraining corpus could raise success rates on longer-horizon or multi-step tasks.
Load-bearing premise
Representing latent actions as rotations in SO(n) plus a triplet-frame loss automatically gives continuity, compositionality, and physical meaning without trivial solutions.
What would settle it
An ablation that removes the SO(n) constraint or the triplet loss and measures whether benchmark success falls to the level of prior discrete quantization methods on identical pretraining data.
Figures
read the original abstract
Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale cross-embodiment robotic datasets and human videos with latent-action supervision. For downstream robot control, the flow-matching head is extended into a unified action expert that jointly denoises latent and robot actions. Here, latent actions serve as a latent planner, providing high-level guidance that conditions action generation. With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RotVLA, a Vision-Language-Action (VLA) framework that models latent actions as elements of the special orthogonal group SO(n) for continuous and compositional representations. It employs a triplet frame learning objective to enforce temporal dynamics and uses a VLM backbone with a flow-matching action head. Pretrained on large-scale cross-embodiment datasets totaling over 1700 hours, RotVLA reports state-of-the-art success rates of 98.2% on the LIBERO benchmark and 89.6%/88.5% on RoboTwin2.0 under clean and randomized conditions, along with strong real-world manipulation performance.
Significance. If the reported performance gains are attributable to the rotational latent action representation and triplet objective rather than scale or data alone, this work could significantly advance VLA models by introducing a geometrically structured continuous latent space that better aligns with physical action dynamics, potentially improving generalization across embodiments.
major comments (2)
- Abstract: The abstract claims that modeling latent actions as SO(n) elements provides continuity, compositionality, and structured geometry while the triplet frame objective avoids degeneration, yet no supporting experiments (e.g., latent interpolation, group composition tests, or ablations against Euclidean latents) are referenced to demonstrate these properties are realized or responsible for the benchmark results.
- Results section: The headline performance numbers (98.2% LIBERO, 89.6%/88.5% RoboTwin) are stated without experimental details, baseline comparisons, ablation studies, or error analysis, preventing verification of the contribution of the proposed SO(n) representation over prior discrete LAMs or the flow-matching head.
minor comments (1)
- Abstract: The parameter count is given as 1.7B but no breakdown of the VLM backbone versus action head is provided.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We have revised the paper to directly address the concerns about supporting evidence for the abstract claims and the level of detail in the results section. All changes are highlighted in the revised version.
read point-by-point responses
-
Referee: Abstract: The abstract claims that modeling latent actions as SO(n) elements provides continuity, compositionality, and structured geometry while the triplet frame objective avoids degeneration, yet no supporting experiments (e.g., latent interpolation, group composition tests, or ablations against Euclidean latents) are referenced to demonstrate these properties are realized or responsible for the benchmark results.
Authors: We agree that the abstract would benefit from explicit references to supporting experiments. In the revised manuscript we have added a dedicated latent-space analysis subsection (Section 4.3) that includes: (i) linear interpolation between latent actions demonstrating continuity on the manifold, (ii) explicit SO(n) group composition tests showing that composing two latent actions yields a valid third action that matches the observed transition, and (iii) an ablation replacing the SO(n) representation with an unconstrained Euclidean latent space of identical dimensionality. These experiments are now cited in the abstract and demonstrate that the geometric structure contributes measurably to the reported performance gains beyond scale alone. revision: yes
-
Referee: Results section: The headline performance numbers (98.2% LIBERO, 89.6%/88.5% RoboTwin) are stated without experimental details, baseline comparisons, ablation studies, or error analysis, preventing verification of the contribution of the proposed SO(n) representation over prior discrete LAMs or the flow-matching head.
Authors: We acknowledge that the original results section was too concise. The revised version now contains: (i) a full experimental protocol subsection detailing training hyperparameters, data splits, and evaluation protocols for both LIBERO and RoboTwin2.0; (ii) expanded baseline tables comparing against all prior discrete LAM-based VLAs and recent flow-matching methods; (iii) systematic ablations that isolate the SO(n) representation, the triplet-frame objective, and the flow-matching head; and (iv) per-task error analysis with failure-mode categorization. These additions allow readers to verify the specific contribution of the rotational latent action design. revision: yes
Circularity Check
No circularity: performance claims are empirical outcomes of pretraining and evaluation
full rationale
The paper reports success rates (98.2% LIBERO, 89.6/88.5% RoboTwin2.0) as results of pretraining a 1.7B VLA model on 1700+ hours of data using a flow-matching head conditioned on SO(n) latent actions and triplet-frame supervision. No equations, derivations, or self-citations are shown that reduce these metrics to fitted parameters, self-defined quantities, or tautological inputs. The geometric properties of SO(n) and the triplet objective are stated as design motivations for continuity and non-degeneracy, without any reduction that makes the reported numbers follow by construction from the modeling choices themselves.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Elements of SO(n) provide continuity, compositionality, and geometry aligned with real-world action dynamics
invented entities (1)
-
Rotational latent action in SO(n)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry... triplet learning framework... z_comp t→t+2 = z t+1→t+2 · z_I^{-1} · z t→t+1
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
continuous rotational latent action representation... SoftVQ... Proj(M) via SVD
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision- language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
arXiv preprint arXiv:2505.04769 (2025)
Ranjan Sapkota, Yang Cao, Konstantinos I Roumeliotis, and Manoj Karkee. Vision- language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769, 2025
-
[3]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
2023
-
[4]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024
2024
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Latent Action Pretraining from Videos
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
2017
-
[10]
Anthony Liang, Pavel Czempin, Matthew Hong, Yutai Zhou, Erdem Biyik, and Stephen Tu. Clam: Continuous latent action models for robot learning from unlabeled demonstrations.arXiv preprint arXiv:2505.04999, 2025
-
[11]
Softvq-vae: Efficient 1-dimensional continuous tokenizer
Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. Softvq-vae: Efficient 1-dimensional continuous tokenizer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28358– 28370, 2025
2025
-
[12]
Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023
2023
-
[13]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024
-
[15]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models
Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025. 11
-
[17]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
arXiv preprint arXiv:2509.09372 (2025) 1, 9, 11, 25, 26
Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025
-
[19]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al. π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language- action model with open-world generalization, 2025.URL https://arxiv. org/abs/2504.16054, 1 (2):3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
arXiv preprint arXiv:2511.04555 (2025)
Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, et al. Evo-1: Lightweight vision-language-action model with preserved semantic alignment.arXiv preprint arXiv:2511.04555, 2025
-
[24]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523,
Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523, 2025
-
[26]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024
2024
-
[28]
Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024
-
[29]
Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for robot manipulation.arXiv preprint arXiv:2412.04445, 8, 2024
-
[30]
Zuolei Li, Xingyu Gao, Xiaofan Wang, and Jianlong Fu. Latbot: Distilling universal latent actions for vision-language-action models.arXiv preprint arXiv:2511.23034, 2025
-
[31]
Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Como: Learning continuous latent motion from internet videos for scalable robot learning.arXiv preprint arXiv:2505.17006, 2025
-
[32]
Villa-x: enhancing latent action modeling in vision-language-action models,
Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025. 12
-
[33]
Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, and Yansong Tang. Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos.arXiv preprint arXiv:2601.04061, 2026
-
[34]
Motus: A Unified Latent Action World Model
Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Zhejia Cai, Yandan Yang, Xinyuan Chang, Shiyi Liang, Ronghan Chen, Feng Xiong, Mu Xu, and Ruqi Huang. Seeing space and motion: Enhancing latent actions with spatial and dynamic awareness for vla.arXiv preprint arXiv:2509.26251, 2025
-
[36]
Xizhou Bu, Jiexi Lyu, Fulei Sun, Ruichen Yang, Zhiqiang Ma, and Wei Li. Laof: Robust latent action learning with optical flow constraints.arXiv preprint arXiv:2511.16407, 2025
-
[37]
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
Manish Kumar Govind, Dominick Reilly, Pu Wang, and Srijan Das. Unilact: Depth-aware rgb latent action learning for vision-language-action models.arXiv preprint arXiv:2602.20231, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[40]
arXiv preprint arXiv:2412.13877 (2024) 14
Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024
-
[41]
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation
Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022
2022
-
[43]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
StarVLA Community. Starvla: A lego-like codebase for vision-language-action model develop- ing.arXiv preprint arXiv:2604.05014, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
arXiv preprint arXiv:2601.18692 (2026)
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026
-
[46]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Mingxing Xu, Wenrui Dai, Chunmiao Liu, Xing Gao, Weiyao Lin, Guo-Jun Qi, and Hongkai Xiong. Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv:2001.02908, 2020
-
[48]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 13
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[49]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[50]
On the continuity of rotation representations in neural networks
Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019
2019
-
[51]
LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment
Dujun Nie, Fengjiao Chen, Qi Lv, Jun Kuang, Xiaoyu Li, Xuezhi Cao, and Xunliang Cai. Lary: A latent action representation yielding benchmark for generalizable vision-to-action alignment. arXiv preprint arXiv:2604.11689, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[52]
Bc-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022
2022
-
[53]
URLhttps://realmanrobotics.com/
Realman Robotics. URLhttps://realmanrobotics.com/
-
[54]
URLhttps://global.agilex.ai/products/cobot-magic
Agilex Robotics. URLhttps://global.agilex.ai/products/cobot-magic
-
[55]
URLhttps://galaxea-ai.com/products/R1-Lite
Galaxea. URLhttps://galaxea-ai.com/products/R1-Lite
-
[56]
URLhttps://www.agibot.com/products/G1
Agibot. URLhttps://www.agibot.com/products/G1
-
[57]
URLhttps://airbots.online/mmk2
Airbot. URLhttps://airbots.online/mmk2
-
[58]
URLhttps://www.unitree.com/g1/
Unitree Robotics. URLhttps://www.unitree.com/g1/
-
[59]
URLhttps://www.tqartisan.com/productDetails?type=A2
TQ-Artisan. URLhttps://www.tqartisan.com/productDetails?type=A2
-
[60]
URLhttps://www.universal-robots.com/products/ur5e/
UR5e Robotics. URLhttps://www.universal-robots.com/products/ur5e/
-
[61]
URLhttps://franka.de/
Franka Emika Panda Robotics. URLhttps://franka.de/
-
[62]
URLhttps://x-humanoid.com/
Tien Kung Robotics. URLhttps://x-humanoid.com/
-
[63]
Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025
Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025
2025
-
[64]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[65]
Scalable deep reinforcement learning for vision-based robotic manipulation
Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. InConference on robot learning, pages 651–673. PMLR, 2018
2018
-
[66]
Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home.arXiv preprint arXiv:2311.16098, 2023
-
[67]
Fmb: a functional manipulation benchmark for generalizable robotic learning
Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. The International Journal of Robotics Research, 44(4):592–606, 2025
2025
-
[68]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023
2023
-
[69]
Rutav Shah, Roberto Martín-Martín, and Yuke Zhu. Mutex: Learning unified policies from multimodal task specifications.arXiv preprint arXiv:2309.14320, 2023
-
[70]
Hydra: Hybrid robot actions for imitation learning
Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. InConference on Robot Learning, pages 2113–2133. PMLR, 2023. 14
2023
-
[71]
Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill-based imitation learning.arXiv preprint arXiv:2210.11435, 2022
-
[72]
Robot learning on the job: Human-in-the-loop autonomy and learning during deployment.The International Journal of Robotics Research, 44(10-11):1727–1742, 2025
Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human-in-the-loop autonomy and learning during deployment.The International Journal of Robotics Research, 44(10-11):1727–1742, 2025
2025
-
[73]
Train offline, test online: A real robot learning benchmark.arXiv preprint arXiv:2306.00942, 2023
Gaoyue Zhou, Victoria Dean, Mohan Kumar Srirama, Aravind Rajeswaran, Jyothish Pari, Kyle Hatch, Aryan Jain, Tianhe Yu, Pieter Abbeel, Lerrel Pinto, et al. Train offline, test online: A real robot learning benchmark.arXiv preprint arXiv:2306.00942, 2023
-
[74]
Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affor- dances over unstructured data.arXiv preprint arXiv:2210.01911, 2022
-
[75]
Roboturk: A crowdsourcing platform for robotic skill learning through imitation
Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. InConference on Robot Learning, pages 879–893. PMLR, 2018
2018
-
[76]
Multi-resolution sensing for real-time control with vision-language models
Saumya Saxena, Mohit Sharma, and Oliver Kroemer. Multi-resolution sensing for real-time control with vision-language models. In2nd Workshop on Language and Robot Learning: Language as Grounding, 2023
2023
-
[77]
Berkeley UR5 demonstration dataset.https://sites.google.com/view/berkeley-ur5/home
Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset.https://sites.google.com/view/berkeley-ur5/home
-
[78]
Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. URL https://github.com/clvrai/clvr_ jaco_play_dataset
2023
-
[79]
Viola: Imitation learning for vision- based manipulation with object proposal priors
Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision- based manipulation with object proposal priors. InConference on Robot Learning, pages 1199–1210. PMLR, 2023
2023
-
[80]
Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot, 2023
Xinghao Zhu, Ran Tian, Chenfeng Xu, Mingxiao Huo, Wei Zhan, Masayoshi Tomizuka, and Mingyu Ding. Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.