pith. sign in

arxiv: 2605.18722 · v1 · pith:DVX6BQIHnew · submitted 2026-05-18 · 💻 cs.RO

Dexora: Open-source VLA for High-DoF Bimanual Dexterity

Pith reviewed 2026-05-20 09:39 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-actionbimanual dexterityrobotic manipulationdexterous handsteleoperationdiffusion policydual-arm controlopen-source VLA
0
0 comments X

The pith

Dexora shows that a quality-weighted VLA trained on matched synthetic and real teleoperation data can learn effective high-DoF bimanual dexterity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Dexora, the first open-source vision-language-action model built for dual-arm dual-hand high-DoF manipulation. It collects large embodiment-matched datasets through a hybrid teleoperation pipeline that separates arm kinematics from finger motion and drives both a physical robot and its MuJoCo twin. Training uses an offline discriminator to assign clip-level weights that down-weight noisy demonstrations inside a diffusion-transformer policy. This produces 90 percent success on basic tasks and 66.7 percent average success on dexterous benchmarks, beating prior VLA baselines while also generalizing to new distributions and different embodiments. A sympathetic reader would care because most existing VLAs remain limited to low-DoF grippers or single arms, and high-DoF bimanual control is a necessary step toward robots that can perform complex, human-scale tasks.

Core claim

By combining a hybrid exoskeleton-and-vision teleoperation interface with a 100K-trajectory synthetic corpus and 10K real episodes, then training a diffusion-transformer policy under clip-level weights from an offline discriminator, Dexora creates an effective open-source VLA for high-DoF bimanual dexterity that outperforms competitive baselines on both basic and dexterous benchmarks, reaches 90 percent success on basic tasks, and exhibits robust out-of-distribution and cross-embodiment generalization.

What carries the argument

The data-quality-aware training recipe that uses an offline discriminator to supply clip-level weights for down-weighting low-quality teleoperation demonstrations inside diffusion-transformer policy training on combined synthetic and real data.

If this is right

  • Reaches 90 percent success on basic manipulation tasks.
  • Achieves 66.7 percent average success on dexterous benchmarks versus 51.7 percent for prior VLA baselines.
  • Demonstrates robust generalization to out-of-distribution inputs and to different robot embodiments.
  • Ablations confirm that both the inclusion of real data and the discriminator weighting are necessary for high dexterity performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Open-sourcing the full system and datasets could let other groups extend the approach to additional high-DoF platforms without starting from scratch.
  • The hybrid teleoperation pipeline could be reused to collect higher-quality data for tasks requiring even finer finger coordination than the current benchmarks.
  • Further increases in the volume of real-world episodes might produce additional gains in generalization beyond the out-of-distribution cases already tested.

Load-bearing premise

The offline discriminator can reliably assign clip-level weights that reduce the effect of noisy teleoperation demonstrations enough for the policy to learn high-DoF bimanual control from the mixed datasets.

What would settle it

Training the same diffusion-transformer policy on the combined datasets without the discriminator weights and measuring whether average dexterous success falls to or below the reported 51.7 percent baseline level.

Figures

Figures reproduced from arXiv: 2605.18722 by Dongyun Ge, Guocai Yao, Guoxuan Chi, Hang Zhao, Hao Dong, Hao Zhao, Hongyang Li, Huan-ang Gao, Huazhe Xu, Jianyu Chen, Jiayuan Gu, Jinbang Guo, Jingrui Pang, Kun Li, Li Yi, Minwen Liao, Modi Shi, Pengwei Wang, Rui Chen, Saining Zhang, Shanghang Zhang, Yao Mu, Yixin Zhu, Zhuo Yang, Zongzheng Zhang.

Figure 1
Figure 1. Figure 1: Dexora overview. (a) Motivation: Three illustrative contrasts highlight the need for dual-arm, dual-hand dexterous VLA: piston insertion (requires two arms), book retrieval from a packed shelf (hands with fingers succeed where grippers fail), and bottle opening (12-DoF fingers with lateral swing outperform 6-DoF). (b) Dataset (§III-B): We pretrain on 100K simulated bimanual-hand trajectories and post-train… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of embodiment coverage. Prior works cover either single-arm or low-DoF dual-arm settings. Dexora is the first system posi￾tioned in the dual-arm, high-DoF dexterous region, while also generalizing across simpler embodiments without re-architecture. task diversity, while real data provides fine-grained realism essential for high-DoF bimanual dexterity. Together, this dataset establishes a foundat… view at source ↗
Figure 3
Figure 3. Figure 3: Hardware and teleoperation system. (a) Hybrid teleoperation interface and 12-DoF XHAND. (b)-(c) The operator teleoperates the physical robot and its MujoCo digital twin, so apple→plate demonstrations are collected in real and simulation under the same interface, thereby reducing the sim-to-real gap. III. DEXORA In this section, we first introduce the hardware setup and teleoperation system (Sec. III-A), fo… view at source ↗
Figure 4
Figure 4. Figure 4: Dataset demonstration. (a) Simulation objects subset: our simulator includes 297 objects across 30 categories. (b) Real-world objects (347 objects, 17 categories), covering both basic and dexterous use cases. (c) Per-family task distribution in simulation vs. real. The simulation data only includes basic tasks, while the real-world set shifts weight toward dexterity (20%). (d) Trajectory counts per family … view at source ↗
Figure 5
Figure 5. Figure 5: Dexora framework. (a) Data filtering: From the real-world dataset we pre-screen demonstrations by kinematic smoothness (low acceleration and jerk), then replay them for post-validation and keep the clips that complete the task without collisions, forming a high-quality subset. (b) Discriminator training: With the pretrained diffusion–transformer policy frozen, we compute a log-π proxy for each clip and tra… view at source ↗
Figure 6
Figure 6. Figure 6: Basic tasks suite. (a) Pick and Place (5 tasks). (b) Assemble/Disassemble (5 tasks). (c) Articulated Objects (2 tasks). TABLE I BASIC TASKS EVALUATION. RESULTS ARE SUCCESS RATES (%) OVER 20 TRIALS. GRAY COLUMNS INDICATE BIMANUAL TASKS. Method Pick and Place Assemble / Disassemble Articulated Object Avg. Apple → plate Bowl → bowl Two eggs → box Lift basket Left block → right plate Stack ring blocks Grab squ… view at source ↗
Figure 7
Figure 7. Figure 7: Dexterous manipulation sequences. (a) Use Pen: The left hand picks up the pen (#1), hands it to the right hand (#2); the right thumb depresses the tip (#3) and writes on paper (#4). (b) Cut Leek: The right hand grasps the knife (#1), the left hand stabilizes the leek (#2); the right hand slices (#3) and returns the knife to the table (#4). (c) Rough Dough: Both hands press the rolling pin simultaneously (#… view at source ↗
Figure 9
Figure 9. Figure 9: Cross-embodiment generalization. The Dexora policy transfers to (a) single-arm gripper, (b) dual-arm grippers, and (c) single-arm single￾hand, completing representative tasks like a three-step pepper handover. apple plate stack ring blocks Use pen Cut leek 0 20 40 60 80 100 Success rate (%) 75 90 100 65 80 85 0 35 65 10 60 80 Sim Only Sim + 50% Real Sim + All Real [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models have recently become a central direction in embodied AI, but current systems are restricted to either dual-gripper control or single-arm dexterous hand manipulation. While low-dimensional gripper control can often be handled with simpler methods, high-dimensional dexterous hand control benefits greatly from full end-to-end VLA learning. In this work, we introduce Dexora, the first open-source VLA system that natively targets dual-arm, dual-hand high-DoF manipulation. We design a hybrid teleoperation pipeline that decouples gross arm kinematics (captured with a custom exoskeleton backpack) from fine finger motion (markerless hand tracking via Apple Vision Pro), and that drives both a physical dual-arm dual-hand platform and an identical MuJoCo digital twin. Using that interface, we assemble a large training corpus: an embodiment-matched synthetic corpus (100K simulated trajectories, 6.5M frames) and a real-world dataset of 10K teleoperated episodes (2.92M frames). To mitigate noisy teleoperation demonstrations, we propose a data-quality-aware training recipe: an offline discriminator provides clip-level weights for diffusion-transformer policy training, down-weighting low-quality demonstrations. Empirically, Dexora outperforms competitive VLA baselines on both basic and dexterous benchmarks (e.g., average dexterous success 66.7% vs. 51.7%), attains 90% success on basic tasks, and shows robust out-of-distribution and cross-embodiment generalization. Ablations confirm the importance of real data and the discriminator for dexterity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. Dexora presents the first open-source VLA model for high-DoF bimanual dexterity with dual arms and dual hands. It introduces a hybrid teleoperation pipeline (exoskeleton for gross arm motion, Apple Vision Pro for finger tracking) that drives both a physical platform and MuJoCo twin, assembles a 100K-trajectory synthetic corpus plus 10K real teleoperated episodes, and trains a diffusion-transformer policy with an offline discriminator that supplies clip-level weights to down-weight noisy demonstrations. The paper reports 90% success on basic tasks, 66.7% average dexterous success (vs. 51.7% for competitive VLA baselines), plus robust OOD and cross-embodiment generalization, with ablations attributing gains to real data and the discriminator.

Significance. If the empirical claims hold under rigorous controls, this would be a notable contribution as the first open-source VLA explicitly targeting high-dimensional dual-arm/dual-hand control. The hybrid data recipe and quality-aware weighting address a practical bottleneck in scaling dexterous policies; reproducible open-source release of model, data, and interface would further amplify impact on embodied AI research.

major comments (2)
  1. The central performance claim (66.7% dexterous success vs. 51.7% baselines, plus OOD/cross-embodiment gains) rests on the offline discriminator producing clip-level weights that meaningfully down-weight noisy teleoperation episodes. The abstract states that ablations confirm the discriminator's importance, yet the manuscript provides no direct validation of this mechanism: no weight histograms, no qualitative examples of high- versus low-weight clips, and no reported correlation between assigned weights and independent quality metrics such as trajectory smoothness or human-rated task completion. Without these checks it remains possible that reported gains derive from dataset scale, the hybrid interface, or model capacity rather than the weighting itself.
  2. Baseline comparisons and statistical reporting lack necessary detail for reproducibility and fairness. The abstract cites concrete success-rate deltas but supplies no information on how the competitive VLA baselines were implemented (e.g., exact architectures, training hyperparameters, or whether they received the same mixed synthetic+real corpus), no statistical tests or confidence intervals on the reported percentages, and no precise definitions of the basic versus dexterous task suites or success criteria.
minor comments (2)
  1. Notation for the discriminator output (clip-level weights) should be introduced with an explicit equation or pseudocode block so readers can trace how weights are applied inside the diffusion-transformer loss.
  2. Figure captions for the teleoperation interface and dataset examples would benefit from additional labels indicating which components are synthetic versus real and which arm/hand DoFs are being controlled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and outline the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: The central performance claim (66.7% dexterous success vs. 51.7% baselines, plus OOD/cross-embodiment gains) rests on the offline discriminator producing clip-level weights that meaningfully down-weight noisy teleoperation episodes. The abstract states that ablations confirm the discriminator's importance, yet the manuscript provides no direct validation of this mechanism: no weight histograms, no qualitative examples of high- versus low-weight clips, and no reported correlation between assigned weights and independent quality metrics such as trajectory smoothness or human-rated task completion. Without these checks it remains possible that reported gains derive from dataset scale, the hybrid interface, or model capacity rather than the weighting itself.

    Authors: We agree that direct evidence for the discriminator's weighting mechanism would strengthen the claims. Our existing ablations show performance drops when the weighting is removed, but we acknowledge the absence of supporting visualizations and correlations. In the revised manuscript we will add: (i) histograms of clip-level weights across the real dataset, (ii) qualitative examples of high- versus low-weight trajectories with corresponding smoothness metrics, and (iii) a correlation analysis between assigned weights and independent quality indicators (trajectory jerk and human-rated task completion on a held-out subset). These additions will help isolate the contribution of the weighting from dataset scale and model capacity. revision: yes

  2. Referee: Baseline comparisons and statistical reporting lack necessary detail for reproducibility and fairness. The abstract cites concrete success-rate deltas but supplies no information on how the competitive VLA baselines were implemented (e.g., exact architectures, training hyperparameters, or whether they received the same mixed synthetic+real corpus), no statistical tests or confidence intervals on the reported percentages, and no precise definitions of the basic versus dexterous task suites or success criteria.

    Authors: We accept that additional implementation and statistical details are required for reproducibility. In the revised version we will expand the experimental section to include: (1) exact architectures, training hyperparameters, and data corpus details for each baseline (confirming use of the same mixed synthetic+real data where applicable), (2) precise definitions of the basic and dexterous task suites together with success criteria, and (3) statistical significance tests with 95% confidence intervals computed over multiple random seeds for all reported success rates. These changes will ensure fair and reproducible comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on experimental training and evaluation, not on any derivation that reduces to inputs by construction.

full rationale

The paper describes an empirical pipeline: hybrid teleoperation data collection (synthetic 100K trajectories + real 10K episodes), an offline discriminator that assigns clip-level weights, and training of a diffusion-transformer policy whose success rates (66.7% dexterous, 90% basic) are measured on held-out benchmarks. No equations, uniqueness theorems, or first-principles derivations are invoked whose outputs are definitionally equivalent to the fitted weights or the training corpus itself. The discriminator weighting is a modeling choice whose effectiveness is asserted via ablations, but the reported numbers are direct experimental outcomes rather than predictions forced by the paper's own definitions or self-citations. The derivation chain is therefore self-contained as standard supervised learning on collected data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the hybrid teleoperation interface and the discriminator weighting scheme. No new physical entities are postulated. The main unstated premises are that the collected demonstrations contain sufficient signal for high-DoF learning once low-quality clips are down-weighted and that simulation-to-real transfer works for the reported tasks.

free parameters (1)
  • clip-level discriminator weights
    Learned or tuned weights that scale the contribution of each demonstration segment during diffusion policy training.
axioms (1)
  • domain assumption Hybrid exoskeleton-plus-markerless tracking produces demonstrations whose quality distribution can be meaningfully scored by an offline discriminator
    The data-quality-aware training recipe depends on this premise to mitigate noise in the 10K real episodes.

pith-pipeline@v0.9.0 · 5912 in / 1491 out tokens · 48263 ms · 2026-05-20T09:39:56.542734+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 10 internal anchors

  1. [1]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inCoRL, PMLR, 2023

  2. [2]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Open- vla: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter,et al., “π 0: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

  4. [4]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai,et al., “π 0.5: a vision-language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025

  5. [5]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

  6. [6]

    GR-3 Technical Report

    C. Cheang, S. Chen, Z. Cui, Y . Hu, L. Huang, T. Kong, H. Li, Y . Li, Y . Liu, X. Ma,et al., “Gr-3 technical report,”arXiv preprint arXiv:2507.15493, 2025

  7. [7]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang,et al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025

  8. [8]

    Learning fine-grained bimanual manipulation with low-cost hardware,

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”RSS, 2023

  9. [9]

    Aloha 2: An enhanced low-cost hardware for bimanual teleoperation,

    A. . Team, “Aloha 2: An enhanced low-cost hardware for bimanual teleoperation,”arXiv preprint arXiv:2405.02292, 2024

  10. [10]

    Open teach: A versatile teleoperation system for robotic manipulation,

    A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto, “Open teach: A versatile teleoperation system for robotic manipulation,”CoRL, 2024

  11. [11]

    Bunny-visionpro: Real-time bimanual dexterous teleoperation for imitation learning,

    R. Ding, Y . Qin, J. Zhu, C. Jia, S. Yang, R. Yang, X. Qi, and X. Wang, “Bunny-visionpro: Real-time bimanual dexterous teleoperation for imitation learning,”arXiv preprint arXiv:2407.03162, 2024

  12. [12]

    Anyteleop: A general vision-based dexterous robot arm- hand teleoperation system,

    Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox, “Anyteleop: A general vision-based dexterous robot arm- hand teleoperation system,”RSS, 2023

  13. [13]

    Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network,

    S. Li, X. Ma, H. Liang, M. G ¨orner, P. Ruppel, B. Fang, F. Sun, and J. Zhang, “Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network,”ICRA, 2019

  14. [14]

    Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,

    H. Fang, H.-S. Fang, Y . Wang, J. Ren, J. Chen, R. Zhang, W. Wang, and C. Lu, “Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,”ICRA, 2024

  15. [15]

    Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation,

    M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song, “Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation,”CoRL, 2025

  16. [16]

    Spark-remote: A cost-effective sys- tem for remote bimanual robot teleoperation,

    A. Imdieke and K. Desingh, “Spark-remote: A cost-effective sys- tem for remote bimanual robot teleoperation,”arXiv preprint arXiv:2504.05488, 2025

  17. [17]

    Gello: A general, low- cost, and intuitive teleoperation framework for robot manipulators,

    P. Wu, Y . Shentu, Z. Yi, X. Lin, and P. Abbeel, “Gello: A general, low- cost, and intuitive teleoperation framework for robot manipulators,” IROS, 2024

  18. [18]

    How to train your robots? the impact of demonstration modality on imitation learning,

    H. Li, Y . Cui, and D. Sadigh, “How to train your robots? the impact of demonstration modality on imitation learning,”arXiv preprint arXiv:2503.07017, 2025

  19. [19]

    Language Conditioned Multi-Finger Dexterous Manipulation Enabled by Physical Compliance and Switching of Controllers

    C. Pan, K. Junge, and J. Hughes, “Vision-language-action model and diffusion policy switching enables dexterous control of an anthropo- morphic hand,”arXiv preprint arXiv:2410.14022, 2024

  20. [20]

    Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,

    J. Zhang, H. Liu, D. Li, X. Yu, H. Geng, Y . Ding, J. Chen, and H. Wang, “Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,” in8th CoRL, 2024

  21. [21]

    Dex1b: Learning with 1b demonstrations for dexterous manipulation,

    J. Ye, K. Wang, C. Yuan, R. Yang, Y . Li, J. Zhu, Y . Qin, X. Zou, and X. Wang, “Dex1b: Learning with 1b demonstrations for dexterous manipulation,”arXiv preprint arXiv:2506.17198, 2025

  22. [22]

    G-hop: Generative hand- object prior for interaction reconstruction and grasp synthesis,

    Y . Ye, A. Gupta, K. Kitani, and S. Tulsiani, “G-hop: Generative hand- object prior for interaction reconstruction and grasp synthesis,” in CVPR, pp. 1911–1920, 2024

  23. [23]

    Dexgrasp anything: Towards universal robotic dexterous grasping with physics awareness,

    Y . Zhong, Q. Jiang, J. Yu, and Y . Ma, “Dexgrasp anything: Towards universal robotic dexterous grasping with physics awareness,” in CVPR, pp. 22584–22594, 2025

  24. [24]

    Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy,

    Y . Xu, W. Wan, J. Zhang, H. Liu, Z. Shan, H. Shen, R. Wang, H. Geng, Y . Weng, J. Chen,et al., “Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy,” inCVPR, pp. 4737–4746, 2023

  25. [25]

    Semgrasp: Semantic grasp generation via language aligned discretization,

    K. Li, J. Wang, L. Yang, C. Lu, and B. Dai, “Semgrasp: Semantic grasp generation via language aligned discretization,” inECCV, 2024

  26. [26]

    Realdex: Towards human-like grasping for robotic dexterous hand.arXiv preprint arXiv:2402.13853, 2024

    Y . Liu, Y . Yang, Y . Wang, X. Wu, J. Wang, Y . Yao, S. Schwertfeger, S. Yang, W. Wang, J. Yu,et al., “Realdex: Towards human-like grasping for robotic dexterous hand,”arXiv:2402.13853, 2024

  27. [27]

    Dexonomy: Synthesiz- ing all dexterous grasp types in a grasp taxonomy,

    J. Chen, Y . Ke, L. Peng, and H. Wang, “Dexonomy: Synthesiz- ing all dexterous grasp types in a grasp taxonomy,”arXiv preprint arXiv:2504.18829, 2025

  28. [28]

    Robustdex- grasp: Robust dexterous grasping of general objects,

    H. Zhang, Z. Wu, L. Huang, S. Christen, and J. Song, “Robustdex- grasp: Robust dexterous grasping of general objects,”arXiv preprint arXiv:2504.05287, 2025

  29. [29]

    Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning,

    K. Li, P. Li, T. Liu, Y . Li, and S. Huang, “Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning,” in CVPR, pp. 6991–7003, 2025

  30. [30]

    Dexmimicgen: Automated data generation for biman- ual dexterous manipulation via imitation learning,

    Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. Fan, and Y . Zhu, “Dexmimicgen: Automated data generation for biman- ual dexterous manipulation via imitation learning,”arXiv preprint arXiv:2410.24185, 2024

  31. [31]

    Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

    R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, H. Yin, S. Liu,et al., “Egovla: Learning vision-language-action models from egocentric human videos,”arXiv:2507.12440, 2025

  32. [32]

    Ta- vla: Elucidating the design space of torque-aware vision- language-action models.arXiv preprint arXiv:2509.07962, 2025

    Z. Zhang, H. Xu, Z. Yang, C. Yue, Z. Lin, H.-a. Gao, Z. Wang, and H. Zhao, “Ta-vla: Elucidating the design space of torque-aware vision- language-action models,”arXiv preprint arXiv:2509.07962, 2025

  33. [33]

    Robochemist: Long-horizon and safety-compliant robotic chemical experimentation,

    Z. Zhang, C. Yue, H. Xu, M. Liao, X. Qi, H.-a. Gao, Z. Wang, and H. Zhao, “Robochemist: Long-horizon and safety-compliant robotic chemical experimentation,”arXiv preprint arXiv:2509.08820, 2025

  34. [34]

    GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

    S. Deng, M. Yan, S. Wei, H. Ma, Y . Yang, J. Chen, Z. Zhang, T. Yang, X. Zhang, H. Cui,et al., “Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data,”arXiv preprint arXiv:2505.03233, 2025

  35. [35]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang,et al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, 2025

  36. [36]

    Dexgraspvla: A vision-language-action framework towards general dexterous grasping,

    Y . Zhong, X. Huang, R. Li, C. Zhang, Y . Liang, Y . Yang, and Y . Chen, “Dexgraspvla: A vision-language-action framework towards general dexterous grasping,”arXiv preprint arXiv:2502.20900, 2025

  37. [37]

    Being-h0: Vision-language-action pretraining from large-scale human videos,

    H. Luo, Y . Feng, W. Zhang, S. Zheng, Y . Wang, H. Yuan, J. Liu, C. Xu, Q. Jin, and Z. Lu, “Being-h0: Vision-language-action pretraining from large-scale human videos,”arXiv preprint arXiv:2507.15597, 2025

  38. [38]

    Dreamgen: Unlocking generalization in robot learning through neural trajectories,

    J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin,et al., “Dreamgen: Unlocking generalization in robot learning through neural trajectories,” pp. arXiv–2505, 2025

  39. [39]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  40. [40]

    Objaverse-xl: A universe of 10m+ 3d objects,

    M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadre,et al., “Objaverse-xl: A universe of 10m+ 3d objects,”NeurIPS, vol. 36, 2023

  41. [41]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”JMLR, vol. 21, pp. 1–67, 2020

  42. [42]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 11975–11986, 2023

  43. [43]

    Discriminator-weighted offline imitation learning from suboptimal demonstrations,

    H. Xu, X. Zhan, H. Yin, and H. Qin, “Discriminator-weighted offline imitation learning from suboptimal demonstrations,” inInternational Conference on Machine Learning, pp. 24725–24742, PMLR, 2022

  44. [44]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”IJRR, 2023