arxiv: 2601.02078 · v2 · submitted 2026-01-05 · 💻 cs.RO

Recognition: no theorem link

Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid Robot

Chenghao Yin , Da Huang , Di Yang , Jichao Wang , Nanshu Zhao , Chen Xu , Wenjun Sun , Linjie Hou

show 11 more authors

Zhijun Li Junhui Wu Zhaobo Liu Zhen Xiao Sheng Zhang Lei Bao Rui Feng Zhenquan Pang Jiayu Li Qian Wang Maoqing Yao

Authors on Pith no claims yet

Pith reviewed 2026-05-16 17:58 UTC · model grok-4.3

classification 💻 cs.RO

keywords simulation platformsim-to-real transferhumanoid robotsynthetic dataLLM scene generationrobotic manipulationpolicy trainingautomated evaluation

0 comments

The pith

Genie Sim 3.0 shows synthetic data from LLM-generated scenes can train humanoid robot policies that transfer zero-shot to the real world.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Genie Sim 3.0 as a single simulation platform for humanoid robot manipulation. It supplies an LLM-based generator that turns natural-language instructions into high-fidelity scenes, allowing rapid creation of varied training environments. The platform also supplies the first benchmark that uses an LLM to produce evaluation scenarios and a VLM to score performance automatically. An open-source dataset of more than 10,000 hours across over 200 tasks is released, and experiments demonstrate that policies trained only on this data can be deployed on physical robots without further real-world fine-tuning under controlled conditions.

Core claim

The central claim is that the open-source dataset generated by Genie Sim 3.0 supports robust zero-shot sim-to-real transfer for humanoid robot policies, establishing that synthetic data can serve as an effective substitute for real-world data under controlled conditions for scalable policy training.

What carries the argument

Genie Sim Generator, an LLM-powered tool that constructs high-fidelity scenes from natural language instructions to enable rapid multi-dimensional generalization and large-scale data synthesis.

Load-bearing premise

The generated scenes must achieve sufficient physical and visual fidelity that policies trained on them perform comparably in the real world, and the automated VLM evaluation must accurately predict that real-world performance.

What would settle it

Train a policy exclusively on the released synthetic dataset for one of the 200 tasks and measure whether its success rate on the matching real-world task falls substantially below the reported sim performance.

Figures

Figures reproduced from arXiv: 2601.02078 by Chenghao Yin, Chen Xu, Da Huang, Di Yang, Jiayu Li, Jichao Wang, Junhui Wu, Lei Bao, Linjie Hou, Maoqing Yao, Nanshu Zhao, Qian Wang, Rui Feng, Sheng Zhang, Wenjun Sun, Zhaobo Liu, Zhenquan Pang, Zhen Xiao, Zhijun Li.

**Figure 1.** Figure 1: Overview of Genie Sim 3.0. Genie Sim 3.0 is a full-cycle robotic simulation platform that integrates environment reconstruction, scene generalization, data collection, and automated evaluation. We plan to open-source 5,140 object assets of simulation, more than 10,000 hours of synthetic dataset and 100,000 evaluation scenarios. Abstract— The development of robust and generalizable robot learning models is … view at source ↗

**Figure 2.** Figure 2: The Automated Workflow of Genie Sim Generator. This module captures user intent via multi-round conversation, translates it into executable Python code, and compiles the final scene graph with assets for Isaac Sim. model and stored in a ChromaDB vector database. At runtime, the planner extracts keywords (e.g., “yellow cube”) from scene description and encodes them into the same embedding space using the i… view at source ↗

**Figure 3.** Figure 3: VLM-Driven Evaluation. B. Evaluation Generation Current open-source simulation benchmarks typically rely on predefined instructions, manually annotated success criteria and repeated trial executions to evaluate VLA models. This paradigm yields a largely unidimensional instruction space, limits the scalability of evaluations, and entails high costs for new task creation. LLM exhibits strong natural languag… view at source ↗

**Figure 4.** Figure 4: Automated Data Collection. A complete task parsing and execution pipeline, which improves task success rate through waypoint filtering and a robust retry mechanism [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Task Distribution Matrix. The dataset is constructed along three dimensions: Manipulation skill, Cognitive comprehension, and Task complexity. sequences of fundamental sub-tasks contained within the dataset. To this end, we structure the task taxonomy along three primary axes ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Sim to Real Experiments. Comparison of initial task setup between real and sim testing environments first, whether a simulation benchmark can evaluate model performance as reliably as physical environment; and second, to what extent data synthetic data can effectively substitute for physically-collected data in model training. To address these questions systematically, a series of experiments are designed … view at source ↗

**Figure 8.** Figure 8: Performance Comparison between Real and Sim Environments. The performance in simulation share the same trend with it in real-world environment. Effectiveness of the Synthetic Data. In general, the success rates of the tasks significantly improve with an increasing volume of training data—for both real and synthetic data—which aligns well with the scaling laws in learning performance, see [PITH_FULL_IMAGE… view at source ↗

**Figure 9.** Figure 9: Correlation Analysis of Model Performance on Sim and Real Environments. All 16 models are evaluated both in sim and real environments, which are tagged on the right side of the figure, each with a different color and marker [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Success Rate on both Sim and Real Env in Heatmap. The comparison of the left and right charts intuitively reflects the consistency between the real and the sim environment. Nevertheless, real-world robotics data collection incurs substantial costs and time constraints, limiting the scale and diversity of available datasets. In contrast, synthetic data can be generated efficiently at scale, allowing for ex… view at source ↗

read the original abstract

The development of robust and generalizable robot learning models is critically contingent upon the availability of large-scale, diverse training data and reliable evaluation benchmarks. Collecting data in the physical world poses prohibitive costs and scalability challenges, and prevailing simulation benchmarks frequently suffer from fragmentation, narrow scope, or insufficient fidelity to enable effective sim-to-real transfer. To address these challenges, we introduce Genie Sim 3.0, a unified simulation platform for robotic manipulation. We present Genie Sim Generator, a large language model (LLM)-powered tool that constructs high-fidelity scenes from natural language instructions. Its principal strength resides in rapid and multi-dimensional generalization, facilitating the synthesis of diverse environments to support scalable data collection and robust policy evaluation. We introduce the first benchmark that pioneers the application of LLM for automated evaluation. It leverages LLM to mass-generate evaluation scenarios and employs Vision-Language Model (VLM) to establish an automated assessment pipeline. We also release an open-source dataset comprising more than 10,000 hours of synthetic data across over 200 tasks. Through systematic experimentation, we validate the robust zero-shot sim-to-real transfer capability of our open-source dataset, demonstrating that synthetic data can server as an effective substitute for real-world data under controlled conditions for scalable policy training. For code and dataset details, please refer to: https://github.com/AgibotTech/genie_sim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Genie Sim 3.0 delivers a large open dataset and LLM scene tools for humanoid work, but the zero-shot transfer claims sit on unshown validation.

read the letter

Genie Sim 3.0 stands out for releasing more than 10,000 hours of synthetic data across 200 tasks and for using an LLM to turn text instructions into simulation scenes. The VLM-based automated evaluation pipeline is another practical addition that could cut down on manual benchmarking time. These pieces together give researchers a ready starting point for scaling policy training without building the entire stack themselves. The data release is the part that matters most right now because others can download it and test their own methods immediately. The scene generator adds flexibility for creating varied environments from simple prompts, which addresses a common setup cost in simulation work. The automated benchmark direction is new relative to the platforms referenced in the abstract. The main weakness is the evidence for the central claim. The abstract says systematic experiments confirm robust zero-shot sim-to-real transfer and that synthetic data can substitute for real data under controlled conditions, yet it supplies no numbers, no baseline comparisons, and no check on whether VLM scores line up with actual robot success rates. That correlation matters because vision-language models often miss contact dynamics and grasp stability. Without those anchors or a failure-case breakdown, the substitution argument stays hard to evaluate. The paper targets robot learning groups that want large synthetic corpora for humanoid manipulation and are willing to run their own transfer tests. Readers focused on data scale and automated tooling will find usable material here. It deserves peer review because the dataset and tools are concrete and shared, even if the transfer results need tighter quantitative support to stand on their own.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Genie Sim 3.0, a unified high-fidelity simulation platform for humanoid robot manipulation. It describes an LLM-powered Genie Sim Generator for rapid synthesis of diverse scenes from natural language instructions, the first benchmark that uses LLMs to mass-generate evaluation scenarios and VLMs for an automated assessment pipeline, and the release of an open-source dataset exceeding 10,000 hours of synthetic data across more than 200 tasks. The central claim is that systematic experimentation validates robust zero-shot sim-to-real transfer, demonstrating that the synthetic data can serve as an effective substitute for real-world data under controlled conditions for scalable policy training.

Significance. If the fidelity of generated scenes and the accuracy of the VLM pipeline are rigorously demonstrated with quantitative evidence, the platform and dataset could substantially advance scalable humanoid robot learning by lowering barriers to large-scale data collection and automated benchmarking, enabling broader experimentation in sim-to-real transfer.

major comments (2)

[Abstract] Abstract: the claim that 'systematic experimentation' validates 'robust zero-shot sim-to-real transfer' and that 'synthetic data can serve as an effective substitute for real-world data' is unsupported because no quantitative metrics (success rates, transfer gaps, baselines), experimental protocols, or error analysis are supplied, rendering the central result unevaluable.
[Benchmark section] Automated evaluation pipeline (described in the benchmark section): the VLM-based assessment is presented as establishing reliable policy scoring, yet no correlation analysis (e.g., Pearson r between VLM scores and real-robot success rates) or failure-mode study is reported; this is load-bearing for the transfer claim because VLMs are known to misjudge contact dynamics and grasp stability in manipulation tasks.

minor comments (1)

[Abstract] Abstract: typographical error 'server' should read 'serve' in the phrase 'synthetic data can server as an effective substitute'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We agree that the central claims require stronger quantitative support and will revise the manuscript to address both major comments. Our responses are provided point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'systematic experimentation' validates 'robust zero-shot sim-to-real transfer' and that 'synthetic data can serve as an effective substitute for real-world data' is unsupported because no quantitative metrics (success rates, transfer gaps, baselines), experimental protocols, or error analysis are supplied, rendering the central result unevaluable.

Authors: We acknowledge that the abstract states the claim without embedding specific numbers, protocols, or analysis, which makes the result difficult to evaluate from the abstract alone. The full manuscript contains experimental results in the evaluation section, but these details are not summarized quantitatively in the abstract. We will revise the abstract to include key metrics (e.g., real-robot success rates, sim-to-real transfer gaps relative to real-data baselines) and will add a concise description of the experimental protocol and error analysis. Corresponding expansions will appear in the experiments section. revision: yes
Referee: [Benchmark section] Automated evaluation pipeline (described in the benchmark section): the VLM-based assessment is presented as establishing reliable policy scoring, yet no correlation analysis (e.g., Pearson r between VLM scores and real-robot success rates) or failure-mode study is reported; this is load-bearing for the transfer claim because VLMs are known to misjudge contact dynamics and grasp stability in manipulation tasks.

Authors: We agree that the reliability of the VLM-based scoring pipeline must be demonstrated quantitatively, especially given known limitations of VLMs on contact-rich tasks. The current manuscript describes the pipeline but does not report correlation coefficients or a dedicated failure-mode study. We will add a new subsection in the benchmark section that presents Pearson correlation (and other agreement metrics) between VLM scores and both human annotations and real-robot success rates, along with a failure-mode analysis that explicitly examines misjudgments on grasp stability and contact dynamics. This revision will directly address the load-bearing concern for the transfer claims. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on empirical validation and released dataset

full rationale

The paper presents a simulation platform, LLM-powered scene generator, VLM-based evaluation benchmark, and open-source dataset of over 10,000 hours across 200 tasks. No equations, fitted parameters, or first-principles derivations appear in the provided text. The central claim of zero-shot sim-to-real transfer is supported by systematic experimentation and the released dataset rather than any self-referential construction, self-citation chain, or renaming of inputs as outputs. The work is therefore self-contained against external benchmarks with no load-bearing steps that reduce to the paper's own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software platform and dataset release paper with no mathematical derivations, fitted parameters, or new physical axioms.

pith-pipeline@v0.9.0 · 5606 in / 1085 out tokens · 25784 ms · 2026-05-16T17:58:50.965132+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
cs.RO 2026-04 unverdicted novelty 4.0

JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
Genie Sim PanoRecon: Fast Immersive Scene Generation from Single-View Panorama
cs.RO 2026-04 unverdicted novelty 4.0

A feed-forward Gaussian-splatting system reconstructs photo-realistic 3D scenes from single-view panoramas in seconds via cube-map decomposition and depth-aware fusion for robotic simulation use.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi et al. “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion”. In:Proceedings of Robotics: Science and Systems. Daegu, Republic of Korea, July 2023.DOI:10 . 15607 / RSS . 2023 . XIX.026

work page 2023
[2]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim et al. “OpenVLA: An Open-Source Vision-Language-Action Model”. In:arXiv preprint arXiv:2406.09246(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Is Diversity All You Need for Scalable Robotic Manipulation?

Modi Shi et al. “Is Diversity All You Need for Scalable Robotic Manipulation?” In:arXiv preprint arXiv:2507.06219(2025)

work page arXiv 2025
[4]

DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning

Zhenyu Jiang et al. “DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning”. In:2025 IEEE International Conference on Robotics and Automation (ICRA). 2025

work page 2025
[5]

Object-Centric Dexterous Ma- nipulation from Human Motion Data

Yuanpei Chen et al. “Object-Centric Dexterous Ma- nipulation from Human Motion Data”. In:8th Annual Conference on Robot Learning. 2024

work page 2024
[6]

Jiangran Lyu et al.ScissorBot: Learning Generaliz- able Scissor Skill for Paper Cutting via Simulation, Imitation, and Sim2Real. 2024. arXiv:2409.13966 [cs.RO].URL:https : / / arxiv . org / abs / 2409.13966

work page arXiv 2024
[7]

Sim2Real Predictivity: Does Evaluation in Simulation Predict Real-World Perfor- mance?

Abhishek Kadian et al. “Sim2Real Predictivity: Does Evaluation in Simulation Predict Real-World Perfor- mance?” In:IEEE Robotics and Automation Letters 5.4 (2020), pp. 6670–6677.DOI:10 . 1109 / LRA . 2020.3013848

work page arXiv 2020
[8]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki. “Gen2Sim: Scaling up Robot Learning in Simula- tion with Generative Models”. In:2024 IEEE In- ternational Conference on Robotics and Automation (ICRA). 2024, pp. 6672–6679.DOI:10 . 1109 / ICRA57147.2024.10610566

work page arXiv 2024
[9]

Yufei Jia et al.DISCOVERSE: Efficient Robot Simu- lation in Complex High-Fidelity Environments. 2025. arXiv:2507.21981 [cs.RO].URL:https:// arxiv.org/abs/2507.21981

work page arXiv 2025
[10]

GraspVLA: a Grasping Foun- dation Model Pre-trained on Billion-scale Synthetic Action Data

Shengliang Deng et al. “GraspVLA: a Grasping Foun- dation Model Pre-trained on Billion-scale Synthetic Action Data”. In: (2025). arXiv:2505 . 03233 [cs.RO].URL:https : / / arxiv . org / abs / 2505.03233

work page arXiv 2025
[11]

Arhan Jain et al.PolaRiS: Scalable Real-to-Sim Eval- uations for Generalist Robot Policies. 2025. arXiv: 2512.16881 [cs.RO].URL:https://arxiv. org/abs/2512.16881

work page arXiv 2025
[12]

Learning high-fidelity robot self-model with articulated 3D Gaussian splatting

Kejun Hu, Peng Yu, and Ning Tan. “Learning high-fidelity robot self-model with articulated 3D Gaussian splatting”. In:The International Jour- nal of Robotics Research0.0 (2025).DOI:10 . 1177/02783649251396980. eprint:https:// doi . org / 10 . 1177 / 02783649251396980. URL:https : / / doi . org / 10 . 1177 / 02783649251396980

work page 2025
[13]

Xuanlin Li et al.Evaluating Real-World Robot Manip- ulation Policies in Simulation. 2024. arXiv:2405 . 05941 [cs.RO].URL:https://arxiv.org/ abs/2405.05941

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Yanjiang Guo et al. “Ctrl-world: A controllable gener- ative world model for robot manipulation”. In:arXiv preprint arXiv:2510.10125(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Haoran Geng et al.RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning. 2025. arXiv:2504. 18904 [cs.RO].URL:https://arxiv.org/ abs/2504.18904

work page arXiv 2025
[16]

Alexander Khazatsky et al.DROID: A Large-Scale In- The-Wild Robot Manipulation Dataset. 2025. arXiv: 2403.12945 [cs.RO].URL:https://arxiv. org/abs/2403.12945

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Sudeep Dasari et al.RoboNet: Large-Scale Multi- Robot Learning. 2020. arXiv:1910 . 11215 [cs.RO].URL:https : / / arxiv . org / abs / 1910.11215

work page arXiv 2020
[18]

RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot

Hao-Shu Fang et al. “RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot”. In:2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2024, pp. 653–660

work page 2024
[19]

Embodiment Collaboration et al.Open X- Embodiment: Robotic Learning Datasets and RT-X Models. 2025. arXiv:2310.08864 [cs.RO].URL: https://arxiv.org/abs/2310.08864

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Agibot world colosseo: A large- scale manipulation platform for scalable and intelli- gent embodied systems

Qingwen Bu et al. “Agibot world colosseo: A large- scale manipulation platform for scalable and intelli- gent embodied systems”. In:2025 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). IEEE. 2025

work page 2025
[21]

RoboCasa: Large-Scale Sim- ulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany et al. “RoboCasa: Large-Scale Sim- ulation of Everyday Tasks for Generalist Robots”. In: Robotics: Science and Systems. 2024

work page 2024
[22]

DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Syn- thetic Cluttered Scenes

Jialiang Zhang et al. “DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Syn- thetic Cluttered Scenes”. In:8th Annual Conference on Robot Learning. 2024

work page 2024
[23]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen et al. “Robotwin 2.0: A scalable data generator and benchmark with strong domain random- ization for robust bimanual robotic manipulation”. In: arXiv preprint arXiv:2506.18088(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021

Tianhe Yu et al. “Meta-World: A Benchmark and Eval- uation for Multi-Task and Meta Reinforcement Learn- ing”. In:Conference on Robot Learning (CoRL). 2019. arXiv:1910.10897 [cs.LG].URL:https:// arxiv.org/abs/1910.10897

work page arXiv 2019
[25]

HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reason- ing

Zhi Jing et al. “HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reason- ing”. In:arXiv preprint arXiv:2507.00833(2025)

work page arXiv 2025
[26]

Carmelo Sferrazza et al.HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation. 2024

work page 2024
[27]

Bigym: A demo-driven mobile bi-manual manipulation benchmark,

Nikita Chernyadev et al. “BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark”. In: arXiv preprint arXiv:2407.07788(2024)

work page arXiv 2024
[28]

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Chengshu Li et al. “BEHA VIOR-1K: A Human- Centered, Embodied AI Benchmark with 1,000 Ev- eryday Activities and Realistic Simulation”. In:arXiv preprint arXiv:2403.09227(2024)

work page internal anchor Pith review arXiv 2024
[29]

ManipulaTHOR: A Framework for Visual Object Manipulation

Kiana Ehsani et al. “ManipulaTHOR: A Framework for Visual Object Manipulation”. In:CVPR. 2021

work page 2021
[30]

Sriram Yenamandra et al.HomeRobot: Open- Vocabulary Mobile Manipulation. 2024. arXiv: 2306.11565 [cs.RO].URL:https://arxiv. org/abs/2306.11565

work page arXiv 2024
[31]

DaXBench: Benchmarking De- formable Object Manipulation with Differentiable Physics

Siwei Chen* et al. “DaXBench: Benchmarking De- formable Object Manipulation with Differentiable Physics”. In:ICLR. 2023

work page 2023
[32]

SoftGym: Benchmarking Deep Reinforcement Learning for Deformable Object Ma- nipulation

Xingyu Lin et al. “SoftGym: Benchmarking Deep Reinforcement Learning for Deformable Object Ma- nipulation”. In:Conference on Robot Learning. 2020

work page 2020
[33]

Zhiao Huang et al.PlasticineLab: A Soft-Body Manip- ulation Benchmark with Differentiable Physics. 2021. arXiv:2104.03311 [cs.LG].URL:https:// arxiv.org/abs/2104.03311

work page arXiv 2021
[34]

Irving Fang et al.From Intention to Execution: Probing the Generalization Boundaries of Vision- Language-Action Models. 2025. arXiv:2506.09930 [cs.RO].URL:https : / / arxiv . org / abs / 2506.09930

work page arXiv 2025
[35]

Quan Khanh Luu et al.ManiFeel: Benchmarking and Understanding Visuotactile Manipulation Policy Learning. 2025. arXiv:2505 . 18472 [cs.RO]. URL:https://arxiv.org/abs/2505.18472

work page arXiv 2025
[36]

Exploring the Limits of Vision- Language-Action Manipulation in Cross-task Gener- alization

Jiaming Zhou et al. “Exploring the Limits of Vision- Language-Action Manipulation in Cross-task Gener- alization”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025.URL: https : / / openreview . net / forum ? id = h6xQClTm4W

work page 2025
[37]

Yunzhi Zhang et al.The Scene Language: Repre- senting Scenes with Programs, Words, and Embed- dings. 2025. arXiv:2410.16770 [cs.CV].URL: https://arxiv.org/abs/2410.16770

work page arXiv 2025
[38]

3D Gaussian splatting for real- time radiance field rendering

Bernhard Kerbl et al. “3D Gaussian splatting for real- time radiance field rendering.” In:ACM Trans. Graph. 42.4 (2023), pp. 139–1

work page 2023
[39]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Ra- binovich. “Superpoint: Self-supervised interest point detection and description”. In:Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2018, pp. 224–236

work page 2018
[40]

Lightglue: Local feature matching at light speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. “Lightglue: Local feature matching at light speed”. In:Proceedings of the IEEE/CVF in- ternational conference on computer vision. 2023, pp. 17627–17638

work page 2023
[41]

Domain-size pooling in local descriptors: DSP-SIFT

Jingming Dong and Stefano Soatto. “Domain-size pooling in local descriptors: DSP-SIFT”. In:Proceed- ings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 5097–5106

work page 2015
[42]

Colmap- pcd: An open-source tool for fine image-to-point cloud registration

Chunge Bai, Ruijie Fu, and Xiang Gao. “Colmap- pcd: An open-source tool for fine image-to-point cloud registration”. In:2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2024, pp. 1723–1729

work page 2024
[43]

gsplat: An open-source library for Gaussian splatting

Vickie Ye et al. “gsplat: An open-source library for Gaussian splatting”. In:Journal of Machine Learning Research26.34 (2025), pp. 1–17

work page 2025
[44]

DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

Jay Zhangjie Wu et al. “DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models”. In:Proceedings of the Computer Vision and Pattern Recognition Conference. 2025, pp. 26024–26035

work page 2025
[45]

PGSR: Planar-based Gaussian Splatting for Efficient and High-Fidelity Surface Re- construction

Danpeng Chen et al. “PGSR: Planar-based Gaussian Splatting for Efficient and High-Fidelity Surface Re- construction”. In:arXiv preprint arXiv:2406.06521 (2024)

work page arXiv 2024
[46]

curobo: Parallelized collision-free minimum-jerk robot motion generation

Balakumar Sundaralingam et al. “curobo: Parallelized collision-free minimum-jerk robot motion generation”. In:arXiv preprint arXiv:2310.17274(2023)

work page arXiv 2023
[47]

Graspnet: A large-scale clustered and densely annotated dataset for object grasping, 2020

Haoshu Fang et al. “GraspNet: A Large-Scale Clus- tered and Densely Annotated Datase for Object Grasp- ing”. In:CoRRabs/1912.13470 (2019). arXiv:1912. 13470.URL:http://arxiv.org/abs/1912. 13470

work page arXiv 1912
[48]

Physical Intelligence et al.π 0.5: a Vision-Language- Action Model with Open-World Generalization. 2025. arXiv:2504.16054 [cs.LG].URL:https:// arxiv.org/abs/2504.16054

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu et al. “Univla: Learning to act anywhere with task-centric latent actions”. In:arXiv preprint arXiv:2505.06111(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu et al. “RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation”. In:arXiv preprint arXiv:2410.07864(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng et al. “X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model”. In:arXiv preprint arXiv:2510.10274(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025