DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

Feng Chen; Li Sun; Pei Zhou; Shenghua Gao; Tianzhe Chu; Yanchao Yang; Yi Ma; Yuexiang Zhai; Zhuxiu Xu

arxiv: 2605.18727 · v1 · pith:BTPPXIRKnew · submitted 2026-05-18 · 💻 cs.RO · cs.AI

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

Feng Chen , Tianzhe Chu , Li Sun , Pei Zhou , Zhuxiu Xu , Shenghua Gao , Yuexiang Zhai , Yanchao Yang

show 1 more author

Yi Ma

This is my paper

Pith reviewed 2026-05-20 09:34 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords dexterous manipulationembodied roboticsTexas Hold'embenchmarkShadowHandagentic perceptionpolicy evaluationphysical tabletop tasks

0 comments

The pith

DexHoldem provides a physical benchmark that tests whether embodied agents can perceive, decide, and dexterously manipulate cards through a full Texas Hold'em game loop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DexHoldem as a real-world benchmark for dexterous embodied systems built around Texas Hold'em with a ShadowHand. It supplies 1,470 teleoperated demonstrations across 14 manipulation primitives plus standardized tests for policy execution and agentic perception of game state. A sympathetic reader would care because the benchmark measures the complete cycle of perceiving a changing tabletop, selecting a context-appropriate action, executing it with a dexterous hand, and leaving the scene usable for later moves. Case studies then show how perception and policy errors build up during actual closed-loop runs that include waiting, recovery, and human-help requests.

Core claim

DexHoldem evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. The best policy reaches 61.2 percent task completion and 47.5 percent scene-preserving success on the primitives. The strongest perception model attains 34.3 percent strict problem-level accuracy while reaching 66.8 percent average field-wise accuracy, revealing a gap between isolated visual capabilities and complete state recovery needed for routing. Three case studies instantiate the full embodied loop to demonstrate error accumulation across repeated primitive executions.

What carries the argument

The DexHoldem benchmark, which supplies demonstrations for 14 Texas Hold'em primitives, runs standardized policy and agentic-perception evaluations on physical hardware, and closes the loop with waiting, recovery, and help-request behaviors.

If this is right

Policies achieve at most 61.2 percent task completion and 47.5 percent scene-preserving success on the 14 primitives.
Perception models exhibit a large gap between 66.8 percent field-wise accuracy and 34.3 percent strict game-state accuracy required for decision routing.
Closed-loop deployments reveal compounding errors across perception, policy, and repeated primitive executions.
The benchmark explicitly supports testing of recovery dispatches, human-help requests, and scene-maintenance behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structured game-state recovery tasks could be applied to other multi-step tabletop activities to test generality beyond poker.
Directly feeding perception outputs into policy inputs might shrink the observed error accumulation across full loops.
Repeating the evaluations on different dexterous hardware would show whether the performance numbers are specific to the ShadowHand.
Extending runs to complete multi-hand games would expose whether the current primitives scale to longer sequences.

Load-bearing premise

The 14 chosen Texas Hold'em manipulation primitives and the defined agentic perception tasks are representative of the core challenges faced by embodied agents in dynamic, multi-step physical scenes.

What would settle it

A new policy or perception model run on the identical physical DexHoldem setup that exceeds 61.2 percent task completion or 34.3 percent strict accuracy would directly test whether the reported performance ceilings are fundamental or merely current limits.

Figures

Figures reproduced from arXiv: 2605.18727 by Feng Chen, Li Sun, Pei Zhou, Shenghua Gao, Tianzhe Chu, Yanchao Yang, Yi Ma, Yuexiang Zhai, Zhuxiu Xu.

**Figure 1.** Figure 1: Overview of DexHoldem, a real-world Texas Hold’em benchmark for dexterous manipulation. (a) The setup uses a ShadowHand with top-down, third-person, and wrist-mounted cameras for card and chip manipulation. (b) The system closes the loop by parsing observations into game state, routing instructions, and executing policies. (c,d) Policy and agent benchmarks show that current models still struggle with cont… view at source ↗

**Figure 2.** Figure 2: One decision step of the DexHoldem embodied agent. The agent perceives the tabletop, loads and renews structured game-state memory, routes the state through reasoning checks, and dispatches a dexterous policy when the scene is stable and an executable primitive is needed. In the illustrated step, an unknown left card with the robot idle routes to the agent primitive view_card(L), which translates to the de… view at source ↗

**Figure 3.** Figure 3: Final validation loss for the RDT fine-tuning data-scaling probe. Random and pretrained initializations follow similar data-scaling trends. Error bars denote one standard deviation over three completed paired seeds. We use RDT as a representative policy instantiation to probe how much DexHoldem-specific dexterous-hand data is needed to reliably fit the target action distribution, using held-out action-pr… view at source ↗

**Figure 4.** Figure 4: Some examples from the DexHoldem policy benchmark. .npy arrays. The loader converts dict-valued joint records into the canonical 30-dimensional joint order, constructs 100 training trajectories and 5 validation trajectories per primitive, and exposes a common batch containing RGB/depth observations, optional precomputed RGB features, normalized joint-position proprioception, the instruction ID, and a 30-di… view at source ↗

**Figure 5.** Figure 5: Policy pretraining data scale, policy size, and physical task completion rate on DexHoldem. [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Train-time validation-loss curves for the representative RDT fine-tuning study across [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Simulation check through real-to-sim trajectory replay under the three camera views used [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: Representative top-down replay sequence in the reconstructed simulation environment. [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

read the original abstract

Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, $\pi_{0.5}$ obtains the highest task completion rate ($61.2\%$), while $\pi_{0.5}$ and $\pi_0$ tie on scene-preserving success rate ($47.5\%$). On agentic perception, Opus 4.7 obtains the best strict problem-level accuracy ($34.3\%$), while GPT 5.5 obtains the best average field-wise accuracy ($66.8\%$), exposing a gap between isolated visual sub-capabilities and complete routing-relevant state recovery. Finally, we instantiate the full embodied-agent loop in three case studies, where waiting, recovery dispatches, human-help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed-loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: https://dexholdem.github.io/Dexholdem/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DexHoldem gives a physical benchmark for dexterous poker play but its primitives may not represent core embodied challenges well.

read the letter

DexHoldem gives us a physical benchmark for dexterous manipulation in Texas Hold'em on a ShadowHand, with actual hardware results on policy and perception. That's the main thing to know. The paper does a solid job combining teleoperated demonstrations, a standardized policy benchmark, and an agentic perception test in one setup. They have 1,470 demos across 14 primitives and report specific outcomes: 61.2% task completion for the best policy and 34.3% strict accuracy for the best perception model. The case studies on closed-loop execution show how errors from perception and policy add up in practice, which is useful for understanding integrated systems. This is new as a unified physical test for full embodied loops in a tabletop game. It earns credit for using real hardware and making the project page available for others to build on. The focus on recovering structured game state for decision making is a good angle. The soft spot is the representativeness of the 14 primitives. The abstract doesn't provide justification, ablations, or comparisons to other dexterous tasks, so the stress-test concern holds up. Without that, it's hard to know if this benchmark captures broader challenges in dynamic physical scenes or stays too specific to poker handling. This paper is for researchers in robotics and embodied AI who are interested in benchmarks for dexterous hands and multi-step tasks. Readers who value concrete physical evaluations over simulation will find it worthwhile. It deserves a serious referee because the hardware results and benchmark construction are substantial enough to review, even with the scope questions. I would recommend sending it for peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces DexHoldem, a real-world system-level benchmark for dexterous embodied agents centered on Texas Hold'em card manipulation using a ShadowHand. It contributes 1,470 teleoperated demonstrations across 14 manipulation primitives, a standardized physical policy benchmark reporting task completion and scene-preservation metrics, an agentic perception benchmark measuring recovery of structured game state, and three closed-loop case studies illustrating error accumulation in perception-policy loops.

Significance. If the benchmark's scope is validated, DexHoldem offers a concrete, hardware-grounded testbed that integrates perception, decision routing, and dexterous execution in a shared physical setting, exposing gaps between isolated visual capabilities and complete state recovery needed for embodied decisions. The physical experiments, teleoperated demonstration collection, and explicit case studies on waiting/recovery/human-help behaviors provide reproducible empirical grounding that strengthens claims about real-world deployment challenges.

major comments (1)

[§3] §3 (Benchmark Design): The 14 Texas Hold'em manipulation primitives are introduced as representative of core dexterous tabletop challenges without ablations against alternative tasks, explicit transfer experiments, or quantitative justification that success on these primitives predicts performance on broader multi-step dynamic scenes; this modeling choice is load-bearing for the central claim that DexHoldem evaluates dexterous execution and embodied decision routing in representative physical settings.

minor comments (2)

[Figure 4] Figure 4 and §5.2: The perception accuracy tables would benefit from clearer error bars or per-run variance to allow readers to assess stability of the reported 34.3% strict accuracy and 66.8% field-wise accuracy.
[§6] §6 (Case Studies): The three closed-loop examples are described qualitatively; adding quantitative metrics on error propagation rates across the full loop would strengthen the illustration of how perception and policy errors accumulate.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential of DexHoldem as a hardware-grounded benchmark integrating perception, policy, and dexterous execution. We address the single major comment on benchmark design below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Benchmark Design): The 14 Texas Hold'em manipulation primitives are introduced as representative of core dexterous tabletop challenges without ablations against alternative tasks, explicit transfer experiments, or quantitative justification that success on these primitives predicts performance on broader multi-step dynamic scenes; this modeling choice is load-bearing for the central claim that DexHoldem evaluates dexterous execution and embodied decision routing in representative physical settings.

Authors: We appreciate the referee's point that the selection of the 14 primitives is central to our claims. These primitives were systematically derived from the rules and typical flow of Texas Hold'em, covering the full range of required physical interactions: deck dealing, card flipping and revealing, chip pushing and stacking, and community-card organization. The set was refined through consultation with professional dealers and prior dexterous-manipulation literature to ensure coverage of contact-rich, precision, and in-hand skills that appear in real gameplay. We agree that ablations against alternative task sets or explicit transfer studies would further strengthen generalizability arguments; however, the primary goal of this work is to release a reproducible, domain-specific benchmark rather than to optimize or validate a universal task taxonomy. In the revised manuscript we have added a dedicated paragraph in §3 that (i) lists the explicit mapping from each primitive to core dexterous challenges, (ii) provides frequency estimates drawn from recorded poker sessions, and (iii) references established manipulation taxonomies to supply the requested quantitative grounding. This addition directly supports the claim that success on these primitives is indicative of performance in the broader multi-step physical scenes that constitute the benchmark. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain

full rationale

The paper introduces DexHoldem as a physical benchmark consisting of teleoperated demonstrations, policy evaluations on hardware, and perception model tests. All reported numbers (61.2% task completion, 34.3% strict accuracy, etc.) are direct empirical measurements from runs on the defined primitives and tasks. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs; the 14 primitives and game-state recovery tasks are explicitly chosen modeling decisions rather than derived results. No self-citation load-bearing steps or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The benchmark rests on standard robotics domain assumptions about teleoperation data and hardware fidelity rather than new physical postulates or many fitted parameters.

free parameters (1)

Selection of 14 primitives
The specific set of 14 Texas Hold'em actions is chosen to cover the game but constitutes an ad-hoc modeling decision.

axioms (1)

domain assumption Teleoperated human demonstrations provide a suitable reference distribution for evaluating robot policy performance on the same primitives.
Invoked when using the 1,470 demonstrations as the basis for the policy benchmark.

pith-pipeline@v0.9.0 · 5842 in / 1371 out tokens · 72635 ms · 2026-05-20T09:34:31.854311+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On primitive execution, π0.5 obtains the highest task completion rate (61.2%), while π0.5 and π0 tie on scene-preserving success rate (47.5%).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 20 internal anchors

[1]

Robel: Robotics benchmarks for learning with low-cost robots

Michael Ahn, Henry Zhu, Kristian Hartikainen, Hugo Ponte, Abhishek Gupta, Sergey Levine, and Vikash Kumar. Robel: Robotics benchmarks for learning with low-cost robots. InConfer- ence on robot learning, pp. 1300–1313. PMLR, 2020

work page 2020
[2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob Mc- Grew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

work page 2020
[4]

Dexart: Benchmarking generalizable dexterous manipulation with articulated objects, 2023

Chen Bao, Helin Xu, Yuzhe Qin, and Xiaolong Wang. Dexart: Benchmarking generalizable dexterous manipulation with articulated objects, 2023. URL https://arxiv.org/abs/2305. 05706

work page 2023
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. URL https: //arxiv.org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Rt-1: Robotics transformer for real-world control at scale, 2023

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, ...

work page 2023
[7]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[8]

Solving challenging dexterous manipulation tasks with trajectory optimisation and reinforcement learning, 2021

Henry Charlesworth and Giovanni Montana. Solving challenging dexterous manipulation tasks with trajectory optimisation and reinforcement learning, 2021. URL https://arxiv.org/ abs/2009.05104

work page arXiv 2021
[9]

Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization.arXiv preprint arXiv:2412.16490, 2024

Jiayi Chen, Yubin Ke, and He Wang. Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization.arXiv preprint arXiv:2412.16490, 2024

work page arXiv 2024
[10]

Visual dexterity: In-hand reorientation of novel and complex object shapes.Science Robotics, 8 (84):eadc9244, 2023

Tao Chen, Megha Tippur, Siyang Wu, Vikash Kumar, Edward Adelson, and Pulkit Agrawal. Visual dexterity: In-hand reorientation of novel and complex object shapes.Science Robotics, 8 (84):eadc9244, 2023

work page 2023
[11]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable d...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020. URL https://arxiv.org/abs/ 2002.05709. 10

work page internal anchor Pith review Pith/arXiv arXiv 2020
[13]

Towards human-level bimanual dexterous manipulation with reinforcement learning

Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Zongqing Lu, Stephen Marcus McAleer, Hao Dong, and Song-Chun Zhu. Towards human-level bimanual dexterous manipulation with reinforcement learning. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview...

work page 2022
[14]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

work page 2023
[15]

Benchmarking in-hand manipulation.IEEE Robotics and Automation Letters, 5(2):588–595, April 2020

Silvia Cruciani, Balakumar Sundaralingam, Kaiyu Hang, Vikash Kumar, Tucker Hermans, and Danica Kragic. Benchmarking in-hand manipulation.IEEE Robotics and Automation Letters, 5(2):588–595, April 2020. ISSN 2377-3774. doi: 10.1109/lra.2020.2964160. URL http://dx.doi.org/10.1109/LRA.2020.2964160

work page doi:10.1109/lra.2020.2964160 2020
[16]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...

work page 2023
[17]

D4rl: Datasets for deep data-driven reinforcement learning, 2021

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2021. URL https://arxiv.org/abs/2004. 07219

work page 2021
[18]

Maniskill2: A unified benchmark for generalizable manipulation skills

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. Maniskill2: A unified benchmark for generalizable manipulation skills, 2023. URL https://arxiv.org/abs/2302.04659

work page arXiv 2023
[19]

Baku: An efficient transformer for multi-task policy learning, 2024

Siddhant Haldar, Zhuoran Peng, and Lerrel Pinto. Baku: An efficient transformer for multi-task policy learning, 2024. URLhttps://arxiv.org/abs/2406.07539

work page arXiv 2024
[20]

Scaling Laws for Transfer

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer, 2021. URLhttps://arxiv.org/abs/2102.01293

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Inner monologue: Embodied reasoning through planning with language models, 2022

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022

work page 2022
[22]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. Rlbench: The robot learning benchmark & learning environment, 2019. URL https://arxiv.org/abs/1909. 12271

work page 2019
[25]

Vima: General robot manipulation with multimodal prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts. InFortieth International Conference on Machine Learning, 2023. 11

work page 2023
[26]

Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 16923–16930. IEEE, 2025

work page 2025
[27]

Momagraph: State-aware unified scene graphs with vision-language model for embodied task planning.International Conference on Learning Representations (ICLR) Oral, 2026

Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, and Koushil Sreenath. Momagraph: State-aware unified scene graphs with vision-language model for embodied task planning.International Conference on Learning Representations (ICLR) Oral, 2026

work page 2026
[28]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Big Transfer (BiT): General Visual Repre- sentation Learning, 2020

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning, 2020. URL https://arxiv.org/abs/1912.11370

work page arXiv 2020
[30]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Robohive – a unified framework for robot learning

Vikash Kumar, Rutav Shah, Gaoyue Zhou, Vincent Moens, Vittorio Caggiano, Jay Vakil, Abhishek Gupta, and Aravind Rajeswaran. Robohive – a unified framework for robot learning. InNeurIPS: Conference on Neural Information Processing Systems, 2023. URL https: //sites.google.com/view/robohive

work page 2023
[32]

Code as Policies: Language Model Programs for Embodied Control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control, 2023. URL https://arxiv.org/abs/2209.07753

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https: //arxiv.org/abs/2306.03310

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation, 2025. URLhttps://arxiv.org/abs/2410.07864

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Realdex: Towards human-like grasping for robotic dexterous hand,

Yumeng Liu, Yaxun Yang, Youzhuo Wang, Xiaofei Wu, Jiamin Wang, Yichen Yao, Sören Schwertfeger, Sibei Yang, Wenping Wang, Jingyi Yu, and Yuexin Ma. Realdex: Towards human-like grasping for robotic dexterous hand.arXiv preprint arXiv:2402.13853, 2024. URL https://arxiv.org/abs/2402.13853

work page arXiv 2024
[36]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations.arXiv preprint arXiv:2310.17596, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks, 2022. URL https://arxiv.org/abs/2112.03227

work page arXiv 2022
[38]

Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations, 2021

Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations, 2021

work page 2021
[39]

Robotwin: Dual-arm robot benchmark with generative digital twins (early version), 2025

Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins (early version), 2025. URLhttps://arxiv.org/abs/2409.02920

work page arXiv 2025
[40]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots, 2024. URLhttps://arxiv.org/abs/2406.02523. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yun- liang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science a...

work page 2024
[42]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Virtualhome: Simulating household activities via programs

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 8494–8502, 2018

work page 2018
[44]

Dexpoint: General- izable point cloud reinforcement learning for sim-to-real dexterous manipulation

Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Hao Su, and Xiaolong Wang. Dexpoint: General- izable point cloud reinforcement learning for sim-to-real dexterous manipulation. InConference on Robot Learning, pp. 594–605. PMLR, 2023

work page 2023
[45]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.arXiv preprint arXiv:1709.10087, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[47]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pp. 9339–9347, 2019

work page 2019
[48]

Shadow Robot Company, 2025

Shadow Robot Company.Shadow Dexterous Hand - Technical Specification. Shadow Robot Company, 2025. URL https://shadowrobot.com/wp-content/uploads/2025/ 09/shadow_dexterous_hand_e_technical_specification.pdf

work page 2025
[49]

Shadow teleoperation system: Technical specification, Septem- ber 2025

Shadow Robot Company. Shadow teleoperation system: Technical specification, Septem- ber 2025. URL https://shadowrobot.com/wp-content/uploads/2025/09/shadow_ teleop_technical_specification.pdf. Technical specification

work page 2025
[50]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10740–10749, 2020

work page 2020
[51]

Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei

Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, C. Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments, 2021

work page 2021
[52]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai, 2025

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse kai Chan, Yuan Gao, Xuanlin Li, Tongzhou Mu, Nan Xiao, Arnav Gurha, Viswesh Nagaswamy Rajesh, Yong Woo Choi, Yen-Ru Chen, Zhiao Huang, Roberto Calandra, Rui Chen, Shan Luo, and Hao Su. Maniskill3: Gpu parallelized robotics simulation and r...

work page 2025
[53]

Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation, 2023

Dylan Turpin, Tao Zhong, Shutong Zhang, Guanglei Zhu, Jingzhou Liu, Ritvik Singh, Eric Heiden, Miles Macklin, Stavros Tsogkas, Sven Dickinson, and Animesh Garg. Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation, 2023. URL https: //arxiv.org/abs/2306.08132. 13

work page arXiv 2023
[54]

Enact: Evaluating embodied cognition with world modeling of egocentric interaction.arXiv preprint arXiv:2511.20937, 2025

Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Fei-Fei Li, and Manling Li. Enact: Evaluating embodied cognition with world modeling of egocentric interaction.arXiv preprint arXiv:2511.20937, 2025

work page arXiv 2025
[55]

Dexh2r: A benchmark for dynamic dexterous grasping in human-to- robot handover, 2025

Youzhuo Wang, Jiayi Ye, Chuyang Xiao, Yiming Zhong, Heng Tao, Hang Yu, Yumeng Liu, Jingyi Yu, and Yuexin Ma. Dexh2r: A benchmark for dynamic dexterous grasping in human-to- robot handover, 2025. URLhttps://arxiv.org/abs/2506.23152

work page arXiv 2025
[56]

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents.arXiv preprint arXiv:2502.09560, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Dex1b: Learning with 1b demonstrations for dexterous manipulation,

Jianglong Ye, Keyi Wang, Chengjing Yuan, Ruihan Yang, Yiquan Li, Jiyue Zhu, Yuzhe Qin, Xueyan Zou, and Xiaolong Wang. Dex1b: Learning with 1b demonstrations for dexterous manipulation, 2025. URLhttps://arxiv.org/abs/2506.17198

work page arXiv 2025
[58]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pp. 1094–1100. PMLR, 2020

work page 2020
[59]

Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes, 2024

Jialiang Zhang, Haoran Liu, Danshi Li, Xinqiang Yu, Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes, 2024. URLhttps://arxiv.org/abs/2410.23004

work page arXiv 2024
[60]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks, 2024

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks, 2024. URL https://arxiv.org/abs/2412.18194

work page arXiv 2024
[61]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

loop_stage

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pp. 2165–2183. PMLR, 2023. 14 A Author Contributions Feng Co-proposed and led the project; designed the data-collec...

work page 2023

[1] [1]

Robel: Robotics benchmarks for learning with low-cost robots

Michael Ahn, Henry Zhu, Kristian Hartikainen, Hugo Ponte, Abhishek Gupta, Sergey Levine, and Vikash Kumar. Robel: Robotics benchmarks for learning with low-cost robots. InConfer- ence on robot learning, pp. 1300–1313. PMLR, 2020

work page 2020

[2] [2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob Mc- Grew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

work page 2020

[4] [4]

Dexart: Benchmarking generalizable dexterous manipulation with articulated objects, 2023

Chen Bao, Helin Xu, Yuzhe Qin, and Xiaolong Wang. Dexart: Benchmarking generalizable dexterous manipulation with articulated objects, 2023. URL https://arxiv.org/abs/2305. 05706

work page 2023

[5] [5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. URL https: //arxiv.org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Rt-1: Robotics transformer for real-world control at scale, 2023

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, ...

work page 2023

[7] [7]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[8] [8]

Solving challenging dexterous manipulation tasks with trajectory optimisation and reinforcement learning, 2021

Henry Charlesworth and Giovanni Montana. Solving challenging dexterous manipulation tasks with trajectory optimisation and reinforcement learning, 2021. URL https://arxiv.org/ abs/2009.05104

work page arXiv 2021

[9] [9]

Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization.arXiv preprint arXiv:2412.16490, 2024

Jiayi Chen, Yubin Ke, and He Wang. Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization.arXiv preprint arXiv:2412.16490, 2024

work page arXiv 2024

[10] [10]

Visual dexterity: In-hand reorientation of novel and complex object shapes.Science Robotics, 8 (84):eadc9244, 2023

Tao Chen, Megha Tippur, Siyang Wu, Vikash Kumar, Edward Adelson, and Pulkit Agrawal. Visual dexterity: In-hand reorientation of novel and complex object shapes.Science Robotics, 8 (84):eadc9244, 2023

work page 2023

[11] [11]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable d...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020. URL https://arxiv.org/abs/ 2002.05709. 10

work page internal anchor Pith review Pith/arXiv arXiv 2020

[13] [13]

Towards human-level bimanual dexterous manipulation with reinforcement learning

Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Zongqing Lu, Stephen Marcus McAleer, Hao Dong, and Song-Chun Zhu. Towards human-level bimanual dexterous manipulation with reinforcement learning. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview...

work page 2022

[14] [14]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

work page 2023

[15] [15]

Benchmarking in-hand manipulation.IEEE Robotics and Automation Letters, 5(2):588–595, April 2020

Silvia Cruciani, Balakumar Sundaralingam, Kaiyu Hang, Vikash Kumar, Tucker Hermans, and Danica Kragic. Benchmarking in-hand manipulation.IEEE Robotics and Automation Letters, 5(2):588–595, April 2020. ISSN 2377-3774. doi: 10.1109/lra.2020.2964160. URL http://dx.doi.org/10.1109/LRA.2020.2964160

work page doi:10.1109/lra.2020.2964160 2020

[16] [16]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...

work page 2023

[17] [17]

D4rl: Datasets for deep data-driven reinforcement learning, 2021

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2021. URL https://arxiv.org/abs/2004. 07219

work page 2021

[18] [18]

Maniskill2: A unified benchmark for generalizable manipulation skills

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. Maniskill2: A unified benchmark for generalizable manipulation skills, 2023. URL https://arxiv.org/abs/2302.04659

work page arXiv 2023

[19] [19]

Baku: An efficient transformer for multi-task policy learning, 2024

Siddhant Haldar, Zhuoran Peng, and Lerrel Pinto. Baku: An efficient transformer for multi-task policy learning, 2024. URLhttps://arxiv.org/abs/2406.07539

work page arXiv 2024

[20] [20]

Scaling Laws for Transfer

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer, 2021. URLhttps://arxiv.org/abs/2102.01293

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Inner monologue: Embodied reasoning through planning with language models, 2022

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022

work page 2022

[22] [22]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. Rlbench: The robot learning benchmark & learning environment, 2019. URL https://arxiv.org/abs/1909. 12271

work page 2019

[25] [25]

Vima: General robot manipulation with multimodal prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts. InFortieth International Conference on Machine Learning, 2023. 11

work page 2023

[26] [26]

Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 16923–16930. IEEE, 2025

work page 2025

[27] [27]

Momagraph: State-aware unified scene graphs with vision-language model for embodied task planning.International Conference on Learning Representations (ICLR) Oral, 2026

Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, and Koushil Sreenath. Momagraph: State-aware unified scene graphs with vision-language model for embodied task planning.International Conference on Learning Representations (ICLR) Oral, 2026

work page 2026

[28] [28]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Big Transfer (BiT): General Visual Repre- sentation Learning, 2020

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning, 2020. URL https://arxiv.org/abs/1912.11370

work page arXiv 2020

[30] [30]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

Robohive – a unified framework for robot learning

Vikash Kumar, Rutav Shah, Gaoyue Zhou, Vincent Moens, Vittorio Caggiano, Jay Vakil, Abhishek Gupta, and Aravind Rajeswaran. Robohive – a unified framework for robot learning. InNeurIPS: Conference on Neural Information Processing Systems, 2023. URL https: //sites.google.com/view/robohive

work page 2023

[32] [32]

Code as Policies: Language Model Programs for Embodied Control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control, 2023. URL https://arxiv.org/abs/2209.07753

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https: //arxiv.org/abs/2306.03310

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation, 2025. URLhttps://arxiv.org/abs/2410.07864

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Realdex: Towards human-like grasping for robotic dexterous hand,

Yumeng Liu, Yaxun Yang, Youzhuo Wang, Xiaofei Wu, Jiamin Wang, Yichen Yao, Sören Schwertfeger, Sibei Yang, Wenping Wang, Jingyi Yu, and Yuexin Ma. Realdex: Towards human-like grasping for robotic dexterous hand.arXiv preprint arXiv:2402.13853, 2024. URL https://arxiv.org/abs/2402.13853

work page arXiv 2024

[36] [36]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations.arXiv preprint arXiv:2310.17596, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks, 2022. URL https://arxiv.org/abs/2112.03227

work page arXiv 2022

[38] [38]

Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations, 2021

Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations, 2021

work page 2021

[39] [39]

Robotwin: Dual-arm robot benchmark with generative digital twins (early version), 2025

Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins (early version), 2025. URLhttps://arxiv.org/abs/2409.02920

work page arXiv 2025

[40] [40]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots, 2024. URLhttps://arxiv.org/abs/2406.02523. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yun- liang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science a...

work page 2024

[42] [42]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Virtualhome: Simulating household activities via programs

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 8494–8502, 2018

work page 2018

[44] [44]

Dexpoint: General- izable point cloud reinforcement learning for sim-to-real dexterous manipulation

Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Hao Su, and Xiaolong Wang. Dexpoint: General- izable point cloud reinforcement learning for sim-to-real dexterous manipulation. InConference on Robot Learning, pp. 594–605. PMLR, 2023

work page 2023

[45] [45]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[46] [46]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.arXiv preprint arXiv:1709.10087, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[47] [47]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pp. 9339–9347, 2019

work page 2019

[48] [48]

Shadow Robot Company, 2025

Shadow Robot Company.Shadow Dexterous Hand - Technical Specification. Shadow Robot Company, 2025. URL https://shadowrobot.com/wp-content/uploads/2025/ 09/shadow_dexterous_hand_e_technical_specification.pdf

work page 2025

[49] [49]

Shadow teleoperation system: Technical specification, Septem- ber 2025

Shadow Robot Company. Shadow teleoperation system: Technical specification, Septem- ber 2025. URL https://shadowrobot.com/wp-content/uploads/2025/09/shadow_ teleop_technical_specification.pdf. Technical specification

work page 2025

[50] [50]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10740–10749, 2020

work page 2020

[51] [51]

Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei

Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, C. Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments, 2021

work page 2021

[52] [52]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai, 2025

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse kai Chan, Yuan Gao, Xuanlin Li, Tongzhou Mu, Nan Xiao, Arnav Gurha, Viswesh Nagaswamy Rajesh, Yong Woo Choi, Yen-Ru Chen, Zhiao Huang, Roberto Calandra, Rui Chen, Shan Luo, and Hao Su. Maniskill3: Gpu parallelized robotics simulation and r...

work page 2025

[53] [53]

Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation, 2023

Dylan Turpin, Tao Zhong, Shutong Zhang, Guanglei Zhu, Jingzhou Liu, Ritvik Singh, Eric Heiden, Miles Macklin, Stavros Tsogkas, Sven Dickinson, and Animesh Garg. Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation, 2023. URL https: //arxiv.org/abs/2306.08132. 13

work page arXiv 2023

[54] [54]

Enact: Evaluating embodied cognition with world modeling of egocentric interaction.arXiv preprint arXiv:2511.20937, 2025

Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Fei-Fei Li, and Manling Li. Enact: Evaluating embodied cognition with world modeling of egocentric interaction.arXiv preprint arXiv:2511.20937, 2025

work page arXiv 2025

[55] [55]

Dexh2r: A benchmark for dynamic dexterous grasping in human-to- robot handover, 2025

Youzhuo Wang, Jiayi Ye, Chuyang Xiao, Yiming Zhong, Heng Tao, Hang Yu, Yumeng Liu, Jingyi Yu, and Yuexin Ma. Dexh2r: A benchmark for dynamic dexterous grasping in human-to- robot handover, 2025. URLhttps://arxiv.org/abs/2506.23152

work page arXiv 2025

[56] [56]

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents.arXiv preprint arXiv:2502.09560, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Dex1b: Learning with 1b demonstrations for dexterous manipulation,

Jianglong Ye, Keyi Wang, Chengjing Yuan, Ruihan Yang, Yiquan Li, Jiyue Zhu, Yuzhe Qin, Xueyan Zou, and Xiaolong Wang. Dex1b: Learning with 1b demonstrations for dexterous manipulation, 2025. URLhttps://arxiv.org/abs/2506.17198

work page arXiv 2025

[58] [58]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pp. 1094–1100. PMLR, 2020

work page 2020

[59] [59]

Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes, 2024

Jialiang Zhang, Haoran Liu, Danshi Li, Xinqiang Yu, Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes, 2024. URLhttps://arxiv.org/abs/2410.23004

work page arXiv 2024

[60] [60]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks, 2024

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks, 2024. URL https://arxiv.org/abs/2412.18194

work page arXiv 2024

[61] [61]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [62]

loop_stage

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pp. 2165–2183. PMLR, 2023. 14 A Author Contributions Feng Co-proposed and led the project; designed the data-collec...

work page 2023