pith. sign in

arxiv: 2605.18727 · v1 · pith:BTPPXIRKnew · submitted 2026-05-18 · 💻 cs.RO · cs.AI

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

Pith reviewed 2026-05-20 09:34 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords dexterous manipulationembodied roboticsTexas Hold'embenchmarkShadowHandagentic perceptionpolicy evaluationphysical tabletop tasks
0
0 comments X

The pith

DexHoldem provides a physical benchmark that tests whether embodied agents can perceive, decide, and dexterously manipulate cards through a full Texas Hold'em game loop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DexHoldem as a real-world benchmark for dexterous embodied systems built around Texas Hold'em with a ShadowHand. It supplies 1,470 teleoperated demonstrations across 14 manipulation primitives plus standardized tests for policy execution and agentic perception of game state. A sympathetic reader would care because the benchmark measures the complete cycle of perceiving a changing tabletop, selecting a context-appropriate action, executing it with a dexterous hand, and leaving the scene usable for later moves. Case studies then show how perception and policy errors build up during actual closed-loop runs that include waiting, recovery, and human-help requests.

Core claim

DexHoldem evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. The best policy reaches 61.2 percent task completion and 47.5 percent scene-preserving success on the primitives. The strongest perception model attains 34.3 percent strict problem-level accuracy while reaching 66.8 percent average field-wise accuracy, revealing a gap between isolated visual capabilities and complete state recovery needed for routing. Three case studies instantiate the full embodied loop to demonstrate error accumulation across repeated primitive executions.

What carries the argument

The DexHoldem benchmark, which supplies demonstrations for 14 Texas Hold'em primitives, runs standardized policy and agentic-perception evaluations on physical hardware, and closes the loop with waiting, recovery, and help-request behaviors.

If this is right

  • Policies achieve at most 61.2 percent task completion and 47.5 percent scene-preserving success on the 14 primitives.
  • Perception models exhibit a large gap between 66.8 percent field-wise accuracy and 34.3 percent strict game-state accuracy required for decision routing.
  • Closed-loop deployments reveal compounding errors across perception, policy, and repeated primitive executions.
  • The benchmark explicitly supports testing of recovery dispatches, human-help requests, and scene-maintenance behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structured game-state recovery tasks could be applied to other multi-step tabletop activities to test generality beyond poker.
  • Directly feeding perception outputs into policy inputs might shrink the observed error accumulation across full loops.
  • Repeating the evaluations on different dexterous hardware would show whether the performance numbers are specific to the ShadowHand.
  • Extending runs to complete multi-hand games would expose whether the current primitives scale to longer sequences.

Load-bearing premise

The 14 chosen Texas Hold'em manipulation primitives and the defined agentic perception tasks are representative of the core challenges faced by embodied agents in dynamic, multi-step physical scenes.

What would settle it

A new policy or perception model run on the identical physical DexHoldem setup that exceeds 61.2 percent task completion or 34.3 percent strict accuracy would directly test whether the reported performance ceilings are fundamental or merely current limits.

Figures

Figures reproduced from arXiv: 2605.18727 by Feng Chen, Li Sun, Pei Zhou, Shenghua Gao, Tianzhe Chu, Yanchao Yang, Yi Ma, Yuexiang Zhai, Zhuxiu Xu.

Figure 1
Figure 1. Figure 1: Overview of DexHoldem, a real-world Texas Hold’em benchmark for dexterous manipula￾tion. (a) The setup uses a ShadowHand with top-down, third-person, and wrist-mounted cameras for card and chip manipulation. (b) The system closes the loop by parsing observations into game state, routing instructions, and executing policies. (c,d) Policy and agent benchmarks show that current models still struggle with cont… view at source ↗
Figure 2
Figure 2. Figure 2: One decision step of the DexHoldem embodied agent. The agent perceives the tabletop, loads and renews structured game-state memory, routes the state through reasoning checks, and dispatches a dexterous policy when the scene is stable and an executable primitive is needed. In the illustrated step, an unknown left card with the robot idle routes to the agent primitive view_card(L), which translates to the de… view at source ↗
Figure 3
Figure 3. Figure 3: Final validation loss for the RDT fine-tuning data-scaling probe. Random and pretrained initializations follow similar data-scaling trends. Er￾ror bars denote one standard deviation over three completed paired seeds. We use RDT as a representative policy instantiation to probe how much DexHoldem-specific dexterous-hand data is needed to reliably fit the target action distribution, us￾ing held-out action-pr… view at source ↗
Figure 4
Figure 4. Figure 4: Some examples from the DexHoldem policy benchmark. .npy arrays. The loader converts dict-valued joint records into the canonical 30-dimensional joint order, constructs 100 training trajectories and 5 validation trajectories per primitive, and exposes a common batch containing RGB/depth observations, optional precomputed RGB features, normalized joint-position proprioception, the instruction ID, and a 30-di… view at source ↗
Figure 5
Figure 5. Figure 5: Policy pretraining data scale, policy size, and physical task completion rate on DexHoldem. [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Train-time validation-loss curves for the representative RDT fine-tuning study across [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Simulation check through real-to-sim trajectory replay under the three camera views used [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Representative top-down replay sequence in the reconstructed simulation environment. [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
read the original abstract

Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, $\pi_{0.5}$ obtains the highest task completion rate ($61.2\%$), while $\pi_{0.5}$ and $\pi_0$ tie on scene-preserving success rate ($47.5\%$). On agentic perception, Opus 4.7 obtains the best strict problem-level accuracy ($34.3\%$), while GPT 5.5 obtains the best average field-wise accuracy ($66.8\%$), exposing a gap between isolated visual sub-capabilities and complete routing-relevant state recovery. Finally, we instantiate the full embodied-agent loop in three case studies, where waiting, recovery dispatches, human-help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed-loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: https://dexholdem.github.io/Dexholdem/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DexHoldem, a real-world system-level benchmark for dexterous embodied agents centered on Texas Hold'em card manipulation using a ShadowHand. It contributes 1,470 teleoperated demonstrations across 14 manipulation primitives, a standardized physical policy benchmark reporting task completion and scene-preservation metrics, an agentic perception benchmark measuring recovery of structured game state, and three closed-loop case studies illustrating error accumulation in perception-policy loops.

Significance. If the benchmark's scope is validated, DexHoldem offers a concrete, hardware-grounded testbed that integrates perception, decision routing, and dexterous execution in a shared physical setting, exposing gaps between isolated visual capabilities and complete state recovery needed for embodied decisions. The physical experiments, teleoperated demonstration collection, and explicit case studies on waiting/recovery/human-help behaviors provide reproducible empirical grounding that strengthens claims about real-world deployment challenges.

major comments (1)
  1. [§3] §3 (Benchmark Design): The 14 Texas Hold'em manipulation primitives are introduced as representative of core dexterous tabletop challenges without ablations against alternative tasks, explicit transfer experiments, or quantitative justification that success on these primitives predicts performance on broader multi-step dynamic scenes; this modeling choice is load-bearing for the central claim that DexHoldem evaluates dexterous execution and embodied decision routing in representative physical settings.
minor comments (2)
  1. [Figure 4] Figure 4 and §5.2: The perception accuracy tables would benefit from clearer error bars or per-run variance to allow readers to assess stability of the reported 34.3% strict accuracy and 66.8% field-wise accuracy.
  2. [§6] §6 (Case Studies): The three closed-loop examples are described qualitatively; adding quantitative metrics on error propagation rates across the full loop would strengthen the illustration of how perception and policy errors accumulate.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential of DexHoldem as a hardware-grounded benchmark integrating perception, policy, and dexterous execution. We address the single major comment on benchmark design below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Design): The 14 Texas Hold'em manipulation primitives are introduced as representative of core dexterous tabletop challenges without ablations against alternative tasks, explicit transfer experiments, or quantitative justification that success on these primitives predicts performance on broader multi-step dynamic scenes; this modeling choice is load-bearing for the central claim that DexHoldem evaluates dexterous execution and embodied decision routing in representative physical settings.

    Authors: We appreciate the referee's point that the selection of the 14 primitives is central to our claims. These primitives were systematically derived from the rules and typical flow of Texas Hold'em, covering the full range of required physical interactions: deck dealing, card flipping and revealing, chip pushing and stacking, and community-card organization. The set was refined through consultation with professional dealers and prior dexterous-manipulation literature to ensure coverage of contact-rich, precision, and in-hand skills that appear in real gameplay. We agree that ablations against alternative task sets or explicit transfer studies would further strengthen generalizability arguments; however, the primary goal of this work is to release a reproducible, domain-specific benchmark rather than to optimize or validate a universal task taxonomy. In the revised manuscript we have added a dedicated paragraph in §3 that (i) lists the explicit mapping from each primitive to core dexterous challenges, (ii) provides frequency estimates drawn from recorded poker sessions, and (iii) references established manipulation taxonomies to supply the requested quantitative grounding. This addition directly supports the claim that success on these primitives is indicative of performance in the broader multi-step physical scenes that constitute the benchmark. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain

full rationale

The paper introduces DexHoldem as a physical benchmark consisting of teleoperated demonstrations, policy evaluations on hardware, and perception model tests. All reported numbers (61.2% task completion, 34.3% strict accuracy, etc.) are direct empirical measurements from runs on the defined primitives and tasks. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs; the 14 primitives and game-state recovery tasks are explicitly chosen modeling decisions rather than derived results. No self-citation load-bearing steps or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The benchmark rests on standard robotics domain assumptions about teleoperation data and hardware fidelity rather than new physical postulates or many fitted parameters.

free parameters (1)
  • Selection of 14 primitives
    The specific set of 14 Texas Hold'em actions is chosen to cover the game but constitutes an ad-hoc modeling decision.
axioms (1)
  • domain assumption Teleoperated human demonstrations provide a suitable reference distribution for evaluating robot policy performance on the same primitives.
    Invoked when using the 1,470 demonstrations as the basis for the policy benchmark.

pith-pipeline@v0.9.0 · 5842 in / 1371 out tokens · 72635 ms · 2026-05-20T09:34:31.854311+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 20 internal anchors

  1. [1]

    Robel: Robotics benchmarks for learning with low-cost robots

    Michael Ahn, Henry Zhu, Kristian Hartikainen, Hugo Ponte, Abhishek Gupta, Sergey Levine, and Vikash Kumar. Robel: Robotics benchmarks for learning with low-cost robots. InConfer- ence on robot learning, pp. 1300–1313. PMLR, 2020

  2. [2]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  3. [3]

    Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

    OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob Mc- Grew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

  4. [4]

    Dexart: Benchmarking generalizable dexterous manipulation with articulated objects, 2023

    Chen Bao, Helin Xu, Yuzhe Qin, and Xiaolong Wang. Dexart: Benchmarking generalizable dexterous manipulation with articulated objects, 2023. URL https://arxiv.org/abs/2305. 05706

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. URL https: //arxiv.org/abs/2410.24164

  6. [6]

    Rt-1: Robotics transformer for real-world control at scale, 2023

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, ...

  7. [7]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  8. [8]

    Solving challenging dexterous manipulation tasks with trajectory optimisation and reinforcement learning, 2021

    Henry Charlesworth and Giovanni Montana. Solving challenging dexterous manipulation tasks with trajectory optimisation and reinforcement learning, 2021. URL https://arxiv.org/ abs/2009.05104

  9. [9]

    Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization.arXiv preprint arXiv:2412.16490, 2024

    Jiayi Chen, Yubin Ke, and He Wang. Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization.arXiv preprint arXiv:2412.16490, 2024

  10. [10]

    Visual dexterity: In-hand reorientation of novel and complex object shapes.Science Robotics, 8 (84):eadc9244, 2023

    Tao Chen, Megha Tippur, Siyang Wu, Vikash Kumar, Edward Adelson, and Pulkit Agrawal. Visual dexterity: In-hand reorientation of novel and complex object shapes.Science Robotics, 8 (84):eadc9244, 2023

  11. [11]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable d...

  12. [12]

    A Simple Framework for Contrastive Learning of Visual Representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020. URL https://arxiv.org/abs/ 2002.05709. 10

  13. [13]

    Towards human-level bimanual dexterous manipulation with reinforcement learning

    Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Zongqing Lu, Stephen Marcus McAleer, Hao Dong, and Song-Chun Zhu. Towards human-level bimanual dexterous manipulation with reinforcement learning. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview...

  14. [14]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

  15. [15]

    Benchmarking in-hand manipulation.IEEE Robotics and Automation Letters, 5(2):588–595, April 2020

    Silvia Cruciani, Balakumar Sundaralingam, Kaiyu Hang, Vikash Kumar, Tucker Hermans, and Danica Kragic. Benchmarking in-hand manipulation.IEEE Robotics and Automation Letters, 5(2):588–595, April 2020. ISSN 2377-3774. doi: 10.1109/lra.2020.2964160. URL http://dx.doi.org/10.1109/LRA.2020.2964160

  16. [16]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...

  17. [17]

    D4rl: Datasets for deep data-driven reinforcement learning, 2021

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2021. URL https://arxiv.org/abs/2004. 07219

  18. [18]

    Maniskill2: A unified benchmark for generalizable manipulation skills

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. Maniskill2: A unified benchmark for generalizable manipulation skills, 2023. URL https://arxiv.org/abs/2302.04659

  19. [19]

    Baku: An efficient transformer for multi-task policy learning, 2024

    Siddhant Haldar, Zhuoran Peng, and Lerrel Pinto. Baku: An efficient transformer for multi-task policy learning, 2024. URLhttps://arxiv.org/abs/2406.07539

  20. [20]

    Scaling Laws for Transfer

    Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer, 2021. URLhttps://arxiv.org/abs/2102.01293

  21. [21]

    Inner monologue: Embodied reasoning through planning with language models, 2022

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022

  22. [22]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

  23. [23]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

  24. [24]

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. Rlbench: The robot learning benchmark & learning environment, 2019. URL https://arxiv.org/abs/1909. 12271

  25. [25]

    Vima: General robot manipulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts. InFortieth International Conference on Machine Learning, 2023. 11

  26. [26]

    Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

    Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 16923–16930. IEEE, 2025

  27. [27]

    Momagraph: State-aware unified scene graphs with vision-language model for embodied task planning.International Conference on Learning Representations (ICLR) Oral, 2026

    Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, and Koushil Sreenath. Momagraph: State-aware unified scene graphs with vision-language model for embodied task planning.International Conference on Learning Representations (ICLR) Oral, 2026

  28. [28]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  29. [29]

    Big Transfer (BiT): General Visual Repre- sentation Learning, 2020

    Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning, 2020. URL https://arxiv.org/abs/1912.11370

  30. [30]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

  31. [31]

    Robohive – a unified framework for robot learning

    Vikash Kumar, Rutav Shah, Gaoyue Zhou, Vincent Moens, Vittorio Caggiano, Jay Vakil, Abhishek Gupta, and Aravind Rajeswaran. Robohive – a unified framework for robot learning. InNeurIPS: Conference on Neural Information Processing Systems, 2023. URL https: //sites.google.com/view/robohive

  32. [32]

    Code as Policies: Language Model Programs for Embodied Control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control, 2023. URL https://arxiv.org/abs/2209.07753

  33. [33]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https: //arxiv.org/abs/2306.03310

  34. [34]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation, 2025. URLhttps://arxiv.org/abs/2410.07864

  35. [35]

    Realdex: Towards human-like grasping for robotic dexterous hand,

    Yumeng Liu, Yaxun Yang, Youzhuo Wang, Xiaofei Wu, Jiamin Wang, Yichen Yao, Sören Schwertfeger, Sibei Yang, Wenping Wang, Jingyi Yu, and Yuexin Ma. Realdex: Towards human-like grasping for robotic dexterous hand.arXiv preprint arXiv:2402.13853, 2024. URL https://arxiv.org/abs/2402.13853

  36. [36]

    MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations.arXiv preprint arXiv:2310.17596, 2023

  37. [37]

    Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks, 2022. URL https://arxiv.org/abs/2112.03227

  38. [38]

    Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations, 2021

    Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations, 2021

  39. [39]

    Robotwin: Dual-arm robot benchmark with generative digital twins (early version), 2025

    Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins (early version), 2025. URLhttps://arxiv.org/abs/2409.02920

  40. [40]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots, 2024. URLhttps://arxiv.org/abs/2406.02523. 12

  41. [41]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yun- liang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science a...

  42. [42]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  43. [43]

    Virtualhome: Simulating household activities via programs

    Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 8494–8502, 2018

  44. [44]

    Dexpoint: General- izable point cloud reinforcement learning for sim-to-real dexterous manipulation

    Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Hao Su, and Xiaolong Wang. Dexpoint: General- izable point cloud reinforcement learning for sim-to-real dexterous manipulation. InConference on Robot Learning, pp. 594–605. PMLR, 2023

  45. [45]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

  46. [46]

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.arXiv preprint arXiv:1709.10087, 2017

  47. [47]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pp. 9339–9347, 2019

  48. [48]

    Shadow Robot Company, 2025

    Shadow Robot Company.Shadow Dexterous Hand - Technical Specification. Shadow Robot Company, 2025. URL https://shadowrobot.com/wp-content/uploads/2025/ 09/shadow_dexterous_hand_e_technical_specification.pdf

  49. [49]

    Shadow teleoperation system: Technical specification, Septem- ber 2025

    Shadow Robot Company. Shadow teleoperation system: Technical specification, Septem- ber 2025. URL https://shadowrobot.com/wp-content/uploads/2025/09/shadow_ teleop_technical_specification.pdf. Technical specification

  50. [50]

    Alfred: A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10740–10749, 2020

  51. [51]

    Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei

    Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, C. Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments, 2021

  52. [52]

    Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai, 2025

    Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse kai Chan, Yuan Gao, Xuanlin Li, Tongzhou Mu, Nan Xiao, Arnav Gurha, Viswesh Nagaswamy Rajesh, Yong Woo Choi, Yen-Ru Chen, Zhiao Huang, Roberto Calandra, Rui Chen, Shan Luo, and Hao Su. Maniskill3: Gpu parallelized robotics simulation and r...

  53. [53]

    Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation, 2023

    Dylan Turpin, Tao Zhong, Shutong Zhang, Guanglei Zhu, Jingzhou Liu, Ritvik Singh, Eric Heiden, Miles Macklin, Stavros Tsogkas, Sven Dickinson, and Animesh Garg. Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation, 2023. URL https: //arxiv.org/abs/2306.08132. 13

  54. [54]

    Enact: Evaluating embodied cognition with world modeling of egocentric interaction.arXiv preprint arXiv:2511.20937, 2025

    Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Fei-Fei Li, and Manling Li. Enact: Evaluating embodied cognition with world modeling of egocentric interaction.arXiv preprint arXiv:2511.20937, 2025

  55. [55]

    Dexh2r: A benchmark for dynamic dexterous grasping in human-to- robot handover, 2025

    Youzhuo Wang, Jiayi Ye, Chuyang Xiao, Yiming Zhong, Heng Tao, Hang Yu, Yumeng Liu, Jingyi Yu, and Yuexin Ma. Dexh2r: A benchmark for dynamic dexterous grasping in human-to- robot handover, 2025. URLhttps://arxiv.org/abs/2506.23152

  56. [56]

    EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

    Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents.arXiv preprint arXiv:2502.09560, 2025

  57. [57]

    Dex1b: Learning with 1b demonstrations for dexterous manipulation,

    Jianglong Ye, Keyi Wang, Chengjing Yuan, Ruihan Yang, Yiquan Li, Jiyue Zhu, Yuzhe Qin, Xueyan Zou, and Xiaolong Wang. Dex1b: Learning with 1b demonstrations for dexterous manipulation, 2025. URLhttps://arxiv.org/abs/2506.17198

  58. [58]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pp. 1094–1100. PMLR, 2020

  59. [59]

    Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes, 2024

    Jialiang Zhang, Haoran Liu, Danshi Li, Xinqiang Yu, Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes, 2024. URLhttps://arxiv.org/abs/2410.23004

  60. [60]

    Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks, 2024

    Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks, 2024. URL https://arxiv.org/abs/2412.18194

  61. [61]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  62. [62]

    loop_stage

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pp. 2165–2183. PMLR, 2023. 14 A Author Contributions Feng Co-proposed and led the project; designed the data-collec...