VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

Angjoo Kanazawa; Carmelo Sferrazza; Guanya Shi; Jiaman Li; Karen Liu; Koushil Sreenath; Pei Xu; Pieter Abbeel; Rocky Duan; Sirui Chen

arxiv: 2606.30645 · v1 · pith:6WUXVKTCnew · submitted 2026-06-29 · 💻 cs.RO · cs.AI· cs.GR· cs.SY· eess.SY

VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

Yen-Jen Wang , Jiaman Li , Sirui Chen , Takara E. Truong , Pei Xu , Pieter Abbeel , Rocky Duan , Koushil Sreenath

show 4 more authors

Angjoo Kanazawa Carmelo Sferrazza Guanya Shi Karen Liu

This is my paper

Pith reviewed 2026-06-30 04:47 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.GRcs.SYeess.SY

keywords humanoid roboticsloco-manipulationsynthetic data3D scene reconstructionvision-language-kinematicssim-to-real transferwhole-body control

0 comments

The pith

Synthesizing vision-language-kinematics trajectories in reconstructed scenes trains policies that enable physical humanoid loco-manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies the absence of large-scale datasets containing synchronized egocentric images, language commands, and whole-body kinematic trajectories as the main barrier to perception-based humanoid loco-manipulation. It solves this by reconstructing metric indoor environments with 3D Gaussian Splatting, generating navigation and object-interaction trajectories using privileged scene geometry, and rendering the corresponding egocentric views to create 48,000 complete VLK tuples without human effort. A policy trained on these tuples predicts short-horizon whole-body trajectories that a tracker converts into robot actions, and the resulting system performs navigation and single-object transport on a physical Unitree G1. A sympathetic reader would care because this pipeline removes the need for expensive real-world data collection while still producing transferable perception-based control.

Core claim

Reconstructing scenes with 3D Gaussian Splatting, synthesizing trajectories with privileged scene information, and rendering paired egocentric observations produces 48,000 vision-language-kinematics tuples that suffice to train a policy predicting short-horizon whole-body kinematic trajectories; when executed by a whole-body tracker on the physical Unitree G1, the policy completes navigation and single-object transport tasks.

What carries the argument

The VLK data-generation pipeline that uses 3D Gaussian Splatting for metric scene reconstruction, privileged-information trajectory synthesis for navigation and object interaction, and post-hoc egocentric rendering to create paired supervision for whole-body policy learning.

If this is right

Training data for perception-based loco-manipulation can be scaled without manual collection or real-robot trials.
A single policy can handle both navigation and object transport through short-horizon whole-body kinematic predictions.
3D Gaussian Splatting reconstructions supply metric-scale geometry sufficient for trajectory synthesis that transfers to hardware.
Language commands can be directly paired with robot-compatible motions via the rendered observation-trajectory tuples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reconstruction-plus-synthesis approach could support multi-step or multi-object tasks if the underlying scene model is extended to handle dynamics.
Reducing dependence on real data collection may allow faster iteration on humanoid controllers for varied indoor environments.
Similar synthetic pipelines might be tested on other robot platforms to check whether the transfer property holds beyond the Unitree G1.

Load-bearing premise

Trajectories synthesized with privileged scene information and rendered egocentric views will produce policies that transfer to physical hardware without large domain gaps or additional real data.

What would settle it

Deployment of the trained policy on the physical Unitree G1 yields near-zero success rate on the navigation and single-object transport tasks, or quantitative metrics show a large performance gap relative to privileged-information baselines.

Figures

Figures reproduced from arXiv: 2606.30645 by Angjoo Kanazawa, Carmelo Sferrazza, Guanya Shi, Jiaman Li, Karen Liu, Koushil Sreenath, Pei Xu, Pieter Abbeel, Rocky Duan, Sirui Chen, Takara E. Truong, Yen-Jen Wang.

**Figure 1.** Figure 1: Synthetic interactions in reconstructed scenes enable real-world perception-based locomanipulation. Synthetic data generation produces paired egocentric observations, task instructions, and G1 kinematic trajectories for VLK training. Top: A–C show atomic prompt-conditioned skills generated in reconstructed scenes (left) and executed on the physical G1 in the real world (right). Bottom: long-horizon real-… view at source ↗

**Figure 2.** Figure 2: Method overview. Top: we reconstruct 3D scenes, generate task waypoints, synthesize G1 motions, and render egocentric observations to produce paired vision-language-kinematics supervision. Bottom: the paired data train a VLK policy to predict short-horizon G1 kinematic trajectories from egocentric observations, task instructions, and the current G1 state; at deployment, a whole-body tracker converts the … view at source ↗

**Figure 3.** Figure 3: Examples of generated VLK supervision. Each sequence pairs an egocentric RGB observation and language instruction with the corresponding G1 whole-body kinematic trajectory. ally conditions on the initial object state, object geometry features [38], desired relative wrist poses, and wrist contact labels. These conditions specify interaction constraints such as wrist placement, and contact timing. Additional… view at source ↗

**Figure 4.** Figure 4: Ablation of training-data volume in the lab scene. We vary the number of synthesized trajectories per mode per layout and report success rates on the lab-scene validation set. How does the amount of synthesized training data affect performance across different tasks? For each layout in the lab scene, our default setting generates 1000 trajectories per mode. To study the effect of synthesized training-da… view at source ↗

**Figure 5.** Figure 5: Real-world deployment in lab and apartment scenes. The VLK system executes language-conditioned navigation and box manipulation under different layouts and visual variations. Top: sequential lab-scene executions for manipulation and navigation. Bottom: apartment-scene deployments under lighting (disco light) and layout variations. modes. Object-directed navigation achieves high success rates even with lim… view at source ↗

**Figure 6.** Figure 6: Examples from the synthetic vision-language-kinematics dataset. Each example pairs an [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Deployment system overview. Three concurrent on-robot processes (blue: state estimation and whole-body tracking; orange: VLK inference client) communicate through shared memory and inter-process queues. The client offloads flow-matching sampling to a remote VLK server (green) over a websocket. The whole-body tracker runs at 50 Hz and emits joint-level PD targets, while the inference loop replans at ∼1.8 … view at source ↗

read the original abstract

Perception-based humanoid loco-manipulation requires connecting egocentric observations and task instructions to whole-body motion. Learning this mapping requires synchronized egocentric images, language commands, and robot-compatible kinematic trajectories, yet no existing data source provides this complete tuple at scale. We address this bottleneck by generating vision-language-kinematics (VLK) supervision synthetically in reconstructed scenes. Our pipeline leverages 3D Gaussian Splatting to reconstruct metric-scale indoor environments, synthesizes navigation and object-interaction trajectories using privileged scene information, and renders paired egocentric observations after the fact. We produce 48,000 paired trajectories with no human intervention and train a VLK policy that predicts short-horizon whole-body kinematic trajectories. A whole-body tracker converts these predictions into actions on the physical humanoid. We evaluate on the physical Unitree G1 performing navigation and single-object transport, demonstrating that synthesized interactions in reconstructed scenes provide effective supervision for sim-to-real perception-based humanoid loco-manipulation. Project Website: https://vision-language-kinematics.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new element is a concrete pipeline that uses 3DGS reconstruction plus privileged trajectory synthesis to generate 48k VLK pairs for humanoid loco-manipulation, but the hardware transfer claim has no supporting numbers.

read the letter

The paper shows a way to generate large amounts of vision-language-kinematics data without manual collection. It reconstructs indoor scenes at metric scale with 3D Gaussian Splatting, creates navigation and object-interaction trajectories using full scene knowledge, then renders the matching egocentric views afterward. This produces 48,000 paired examples that train a policy to output short-horizon kinematic trajectories from images and language commands, which a whole-body tracker then executes on the Unitree G1.

The pipeline is straightforward and addresses a real bottleneck in perception-driven humanoid work. Generating the data automatically and at that volume is the practical step forward, and the choice to render views after trajectory synthesis keeps the privileged information separate from the final observations.

The soft spot is the evaluation. The abstract states that the approach works for navigation and single-object transport on the physical robot, yet it supplies no success rates, no baseline comparisons, no ablations on reconstruction quality, and no metrics on how close the rendered images are to real camera feeds. Without those numbers the transfer claim stays unverified.

The work is aimed at groups building whole-body humanoid controllers that rely on egocentric vision and language. Readers already working on synthetic data or sim-to-real gaps for loco-manipulation will find the specific combination of reconstruction, synthesis, and rendering worth looking at.

It deserves peer review because the data-generation idea is concrete and the scale is notable, even if the current evidence for hardware performance needs to be filled in.

Referee Report

1 major / 2 minor

Summary. The paper introduces a synthetic data generation pipeline for vision-language-kinematics (VLK) supervision to train policies for perception-based humanoid loco-manipulation. It reconstructs indoor scenes using 3D Gaussian Splatting, synthesizes navigation and interaction trajectories with privileged scene access, renders egocentric images, generates 48,000 trajectories, trains a policy predicting short-horizon whole-body kinematics, and transfers to the physical Unitree G1 robot using a whole-body tracker, claiming successful demonstration on navigation and single-object transport tasks.

Significance. Should the transfer results be quantitatively validated, this work would offer a valuable contribution by providing a scalable approach to generating large-scale, synchronized VLK data without human effort. This could significantly reduce the data collection barrier for training end-to-end policies that map egocentric vision and language to whole-body actions in humanoid robots. The use of 3DGS for metric-scale reconstruction and post-hoc rendering is a promising direction for bridging simulation and reality in loco-manipulation.

major comments (1)

[Abstract] Abstract: The central claim that synthesized interactions in reconstructed scenes provide effective supervision for sim-to-real perception-based humanoid loco-manipulation is not supported by quantitative evidence. The abstract asserts evaluation and demonstration of effectiveness on the physical Unitree G1 for navigation and single-object transport, yet supplies no success rates, baseline comparisons, error bars, ablation studies on reconstruction fidelity or rendering quality, or transfer metrics. This absence is load-bearing for the supervision claim, as the policy transfer assertion cannot be assessed without these results.

minor comments (2)

The abstract introduces the VLK policy and 'short-horizon whole-body kinematic trajectories' without specifying the prediction horizon, output representation (e.g., joint angles vs. velocities), or conditioning details, which would improve clarity of the learned mapping.
A schematic diagram of the full pipeline (reconstruction → trajectory synthesis → rendering → policy training → tracker) would aid reader understanding of the data flow and privileged information usage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for quantitative support. We address the major comment below and will revise the manuscript to strengthen the presentation of results.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that synthesized interactions in reconstructed scenes provide effective supervision for sim-to-real perception-based humanoid loco-manipulation is not supported by quantitative evidence. The abstract asserts evaluation and demonstration of effectiveness on the physical Unitree G1 for navigation and single-object transport, yet supplies no success rates, baseline comparisons, error bars, ablation studies on reconstruction fidelity or rendering quality, or transfer metrics. This absence is load-bearing for the supervision claim, as the policy transfer assertion cannot be assessed without these results.

Authors: We agree that quantitative metrics are necessary to substantiate the central claim of effective supervision and sim-to-real transfer. The manuscript currently presents the pipeline and qualitative physical demonstrations; in revision we will add success rates for navigation and single-object transport on the Unitree G1, baseline comparisons, error bars, and any available ablations on reconstruction fidelity or rendering quality to the results section and abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical data-generation pipeline with no derivation chain

full rationale

The paper presents an empirical pipeline that reconstructs scenes with 3D Gaussian Splatting, synthesizes trajectories using privileged information, renders egocentric views, and trains a policy on the resulting VLK tuples. No equations, fitted parameters renamed as predictions, self-citational uniqueness theorems, or ansatzes are described in the abstract or reader summary. The central claim is that the generated synthetic supervision enables sim-to-real transfer on hardware, which is an empirical assertion rather than a mathematical derivation that could reduce to its own inputs by construction. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method implicitly relies on standard assumptions from 3D reconstruction and robotics simulation literature that are not detailed here.

pith-pipeline@v0.9.1-grok · 5770 in / 1192 out tokens · 37584 ms · 2026-06-30T04:47:45.822079+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 1 canonical work pages

[1]

Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025. URLhttps://arxiv.org/abs/2511.02832

arXiv 2025
[2]

Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang. Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks.arXiv preprint arXiv:2506.08931, 2025. URL https://arxiv.org/abs/2506.08931

arXiv 2025
[3]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, G. Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023
[5]

S. Chen, S. Zhao, Z. Wu, J. Li, G. Shi, and C. K. Liu. Scenebot: Contact-prompted general humanoid whole body tracking with scene-interaction, 2026. URLhttps://arxiv.org/ abs/2606.27581

Pith/arXiv arXiv 2026
[6]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

2023
[7]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learni...

2025
[8]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, 9 K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control. InRobotics: Science and...

2025
[9]

Bjorck, F

NVIDIA, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[10]

H. Xue, X. Huang, D. Niu, Q. Liao, T. Kragerud, J. T. Gravdahl, X. B. Peng, G. Shi, T. Dar- rell, K. Sreenath, and S. Sastry. LeVERB: Humanoid whole-body control with latent vision- language instruction.arXiv preprint arXiv:2506.13751, 2025

arXiv 2025
[11]

Jiang, J

H. Jiang, J. Chen, Q. Bu, L. Chen, M. Shi, Y . Zhang, D. Li, C. Suo, C. Wang, Z. Peng, and H. Li. WholeBodyVLA: Towards unified latent VLA for whole-body loco-manipulation control. InInternational Conference on Learning Representations, 2026

2026
[12]

P. Ding, J. Ma, X. Tong, B. Zou, X. Luo, Y . Fan, T. Wang, H. Lu, P. Mo, J. Liu, et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025

arXiv 2025
[13]

S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, S. Zang, W. Yuan, M. Pavone, D. Huang, and Y . Wang.Ψ 0: An open foundation model towards universal hu- manoid loco-manipulation. InRobotics: Science and Systems, 2026

2026
[14]

S. Chen, Y . Ye, Z.-A. Cao, J. Lew, P. Xu, and C. K. Liu. Hand-eye autonomous delivery: Learning humanoid navigation, locomotion and reaching.arXiv preprint arXiv:2508.03068, 2025

arXiv 2025
[15]

Y . Shao, X. Huang, B. Zhang, Q. Liao, Y . Gao, Y . Chi, Z. Li, S. Shao, and K. Sreenath. Lang- WBC: Language-directed humanoid whole-body control via end-to-end learning. InRobotics: Science and Systems, 2025

2025
[16]

P. Li, Z. Zhuang, Y . Gao, Y . Dong, S. Li, C. Jiang, S. Dou, Z. Xi, E. Zhou, J. Huang, H. Li, J. Gong, X. Ma, T. Gui, Z. Wu, Q. Zhang, X. Huang, Y .-G. Jiang, and X. Qiu. FRoM-W1: Towards general humanoid whole-body control with language instructions.arXiv preprint arXiv:2601.12799, 2026

arXiv 2026
[18]

Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xu, L. Lin, Z. Xie, M. Ding, and P. Luo. RoboTwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2025

2025
[19]

M. N. Qureshi, S. Garg, F. Yandun, D. Held, G. Kantor, and A. Silwal. SplatSim: Zero- shot sim2real transfer of RGB manipulation policies using gaussian splatting.arXiv preprint arXiv:2409.10161, 2024

arXiv 2024
[20]

X. Li, J. Li, Z. Zhang, R. Zhang, F. Jia, T. Wang, H. Fan, K.-K. Tseng, and R. Wang. RoboGSim: A real2sim2real robotic gaussian splatting simulator.arXiv preprint arXiv:2411.11839, 2024

arXiv 2024
[21]

Y . Wu, L. Pan, W. Wu, G. Wang, Y . Miao, and H. Wang. RL-GSBridge: 3d gaussian splatting based real2sim2real method for robotic manipulation learning. InIEEE International Confer- ence on Robotics and Automation, 2025. 10

2025
[22]

Escontrela, J

A. Escontrela, J. Kerr, A. Allshire, J. Frey, R. Duan, C. Sferrazza, and P. Abbeel. GaussGym: An open-source real-to-sim framework for learning locomotion from pixels. InInternational Conference on Learning Representations, 2026

2026
[23]

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. OmniRetarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction. InIEEE International Conference on Robotics and Au- tomation, 2026

2026
[24]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking. InIEEE International Conference on Robotics and Automation, 2026

2026
[25]

T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Om- niH2O: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. InProceedings of The 8th Conference on Robot Learning, Proceedings of Machine Learning Research. PMLR, 2024

2024
[26]

Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. GMT: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

arXiv 2025
[27]

Q. Liao, T. E. Truong, X. Huang, G. Tevet, K. Sreenath, and C. K. Liu. BeyondMimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025
[28]

S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. Resmimic: From gen- eral motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

arXiv 2025
[29]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, X. Da, R. Ding, C. Hogg, L. Song, E. Lim, E. Jeong, T. He, H. Xue, W. Xiao, Z. Wang, S. Yuen, J. Kautz, Y . Chang, U. Iqbal, L. Fan, and Y . Zhu. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025
[30]

Kalaria, S

D. Kalaria, S. S. Harithas, P. Katara, S. Kwak, S. Bhagat, S. Sastry, S. Sridhar, S. Vemprala, A. Kapoor, and J. C.-K. Huang. DreamControl: Human-inspired whole-body humanoid con- trol for scene interaction via guided diffusion.arXiv preprint arXiv:2509.14353, 2025

arXiv 2025
[31]

Zhang, K

Z. Zhang, K. Wen, M. Xu, J. He, C. Li, T. Miki, C. Schwarke, C. Zhang, X. B. Peng, and M. Hutter. Learning whole-body humanoid locomotion via motion generation and motion tracking.arXiv preprint arXiv:2604.17335, 2026

Pith/arXiv arXiv 2026
[32]

Rempe, M

D. Rempe, M. Zanfir, Y .-T. Hu, R. Li, C. Li, J. Tseng, L. Liu, T. Darrell, A. Kanazawa, J. Kautz, C. K. Liu, S. Fidler, L. Guibas, S. Saito, O. Litany, G. Shafir, T.-Y . Lin, S. Tulsiani, A. A. Efros, et al. Kimodo: Scaling controllable human motion generation.arXiv preprint arXiv:2603.15546, 2026. URLhttps://arxiv.org/abs/2603.15546

arXiv 2026
[33]

J. Li, A. Clegg, R. Mottaghi, J. Wu, X. Puig, and C. K. Liu. Controllable human-object interaction synthesis. InEuropean Conference on Computer Vision, 2024

2024
[34]

Z. Wu, J. Li, P. Xu, and C. K. Liu. Human-object interaction from human-level instructions. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025
[35]

BONES-SEED: Skeletal everyday embodiment dataset.https:// huggingface.co/datasets/bones-studio/seed, 2025

Bones Studio. BONES-SEED: Skeletal everyday embodiment dataset.https:// huggingface.co/datasets/bones-studio/seed, 2025. Hugging Face dataset. Accessed: 2026-05-22

2025
[36]

J. Li, J. Wu, and C. K. Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics, 42(6):225:1–225:11, 2023. doi:10.1145/3618333. 11

work page doi:10.1145/3618333 2023
[37]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019

2019
[38]

Prokudin, C

S. Prokudin, C. Lassner, and J. Romero. Efficient learning on point clouds with basis point sets. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4332–4341, 2019

2019
[39]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[40]

S. Chen, Z. ang Cao, Z. Luo, F. Casta ˜neda, C. Li, T. Wang, Y . Yuan, L. J. Fan, C. K. Liu, and Y . Zhu. Chip: Adaptive compliance for humanoid control through hindsight perturbation,
[41]

URLhttps://arxiv.org/abs/2512.14689

arXiv
[42]

Y . Ma, Y . Zhou, Y . Yang, T. Wang, and H. Fan. Running vlas at real-time speed.arXiv preprint arXiv:2510.26742, 2025

arXiv 2025
[43]

Z. Wu, J. Li, P. Xu, and C. K. Liu. Human-object interaction from human-level instructions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11176– 11186, 2025

2025
[44]

B. Yi, V . Ye, M. Zheng, Y . Li, L. M ¨uller, G. Pavlakos, Y . Ma, J. Malik, and A. Kanazawa. Estimating body and hand motion in an ego-sensed world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[45]

walk toward the chair

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Z ¨urbr¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, ...

Pith/arXiv arXiv 2025

[1] [1]

Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025. URLhttps://arxiv.org/abs/2511.02832

arXiv 2025

[2] [2]

Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang. Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks.arXiv preprint arXiv:2506.08931, 2025. URL https://arxiv.org/abs/2506.08931

arXiv 2025

[3] [3]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, G. Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023

[4] [5]

S. Chen, S. Zhao, Z. Wu, J. Li, G. Shi, and C. K. Liu. Scenebot: Contact-prompted general humanoid whole body tracking with scene-interaction, 2026. URLhttps://arxiv.org/ abs/2606.27581

Pith/arXiv arXiv 2026

[5] [6]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

2023

[6] [7]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learni...

2025

[7] [8]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, 9 K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control. InRobotics: Science and...

2025

[8] [9]

Bjorck, F

NVIDIA, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[9] [10]

H. Xue, X. Huang, D. Niu, Q. Liao, T. Kragerud, J. T. Gravdahl, X. B. Peng, G. Shi, T. Dar- rell, K. Sreenath, and S. Sastry. LeVERB: Humanoid whole-body control with latent vision- language instruction.arXiv preprint arXiv:2506.13751, 2025

arXiv 2025

[10] [11]

Jiang, J

H. Jiang, J. Chen, Q. Bu, L. Chen, M. Shi, Y . Zhang, D. Li, C. Suo, C. Wang, Z. Peng, and H. Li. WholeBodyVLA: Towards unified latent VLA for whole-body loco-manipulation control. InInternational Conference on Learning Representations, 2026

2026

[11] [12]

P. Ding, J. Ma, X. Tong, B. Zou, X. Luo, Y . Fan, T. Wang, H. Lu, P. Mo, J. Liu, et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025

arXiv 2025

[12] [13]

S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, S. Zang, W. Yuan, M. Pavone, D. Huang, and Y . Wang.Ψ 0: An open foundation model towards universal hu- manoid loco-manipulation. InRobotics: Science and Systems, 2026

2026

[13] [14]

S. Chen, Y . Ye, Z.-A. Cao, J. Lew, P. Xu, and C. K. Liu. Hand-eye autonomous delivery: Learning humanoid navigation, locomotion and reaching.arXiv preprint arXiv:2508.03068, 2025

arXiv 2025

[14] [15]

Y . Shao, X. Huang, B. Zhang, Q. Liao, Y . Gao, Y . Chi, Z. Li, S. Shao, and K. Sreenath. Lang- WBC: Language-directed humanoid whole-body control via end-to-end learning. InRobotics: Science and Systems, 2025

2025

[15] [16]

P. Li, Z. Zhuang, Y . Gao, Y . Dong, S. Li, C. Jiang, S. Dou, Z. Xi, E. Zhou, J. Huang, H. Li, J. Gong, X. Ma, T. Gui, Z. Wu, Q. Zhang, X. Huang, Y .-G. Jiang, and X. Qiu. FRoM-W1: Towards general humanoid whole-body control with language instructions.arXiv preprint arXiv:2601.12799, 2026

arXiv 2026

[16] [18]

Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xu, L. Lin, Z. Xie, M. Ding, and P. Luo. RoboTwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2025

2025

[17] [19]

M. N. Qureshi, S. Garg, F. Yandun, D. Held, G. Kantor, and A. Silwal. SplatSim: Zero- shot sim2real transfer of RGB manipulation policies using gaussian splatting.arXiv preprint arXiv:2409.10161, 2024

arXiv 2024

[18] [20]

X. Li, J. Li, Z. Zhang, R. Zhang, F. Jia, T. Wang, H. Fan, K.-K. Tseng, and R. Wang. RoboGSim: A real2sim2real robotic gaussian splatting simulator.arXiv preprint arXiv:2411.11839, 2024

arXiv 2024

[19] [21]

Y . Wu, L. Pan, W. Wu, G. Wang, Y . Miao, and H. Wang. RL-GSBridge: 3d gaussian splatting based real2sim2real method for robotic manipulation learning. InIEEE International Confer- ence on Robotics and Automation, 2025. 10

2025

[20] [22]

Escontrela, J

A. Escontrela, J. Kerr, A. Allshire, J. Frey, R. Duan, C. Sferrazza, and P. Abbeel. GaussGym: An open-source real-to-sim framework for learning locomotion from pixels. InInternational Conference on Learning Representations, 2026

2026

[21] [23]

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. OmniRetarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction. InIEEE International Conference on Robotics and Au- tomation, 2026

2026

[22] [24]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking. InIEEE International Conference on Robotics and Automation, 2026

2026

[23] [25]

T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Om- niH2O: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. InProceedings of The 8th Conference on Robot Learning, Proceedings of Machine Learning Research. PMLR, 2024

2024

[24] [26]

Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. GMT: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

arXiv 2025

[25] [27]

Q. Liao, T. E. Truong, X. Huang, G. Tevet, K. Sreenath, and C. K. Liu. BeyondMimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025

[26] [28]

S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. Resmimic: From gen- eral motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

arXiv 2025

[27] [29]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, X. Da, R. Ding, C. Hogg, L. Song, E. Lim, E. Jeong, T. He, H. Xue, W. Xiao, Z. Wang, S. Yuen, J. Kautz, Y . Chang, U. Iqbal, L. Fan, and Y . Zhu. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025

[28] [30]

Kalaria, S

D. Kalaria, S. S. Harithas, P. Katara, S. Kwak, S. Bhagat, S. Sastry, S. Sridhar, S. Vemprala, A. Kapoor, and J. C.-K. Huang. DreamControl: Human-inspired whole-body humanoid con- trol for scene interaction via guided diffusion.arXiv preprint arXiv:2509.14353, 2025

arXiv 2025

[29] [31]

Zhang, K

Z. Zhang, K. Wen, M. Xu, J. He, C. Li, T. Miki, C. Schwarke, C. Zhang, X. B. Peng, and M. Hutter. Learning whole-body humanoid locomotion via motion generation and motion tracking.arXiv preprint arXiv:2604.17335, 2026

Pith/arXiv arXiv 2026

[30] [32]

Rempe, M

D. Rempe, M. Zanfir, Y .-T. Hu, R. Li, C. Li, J. Tseng, L. Liu, T. Darrell, A. Kanazawa, J. Kautz, C. K. Liu, S. Fidler, L. Guibas, S. Saito, O. Litany, G. Shafir, T.-Y . Lin, S. Tulsiani, A. A. Efros, et al. Kimodo: Scaling controllable human motion generation.arXiv preprint arXiv:2603.15546, 2026. URLhttps://arxiv.org/abs/2603.15546

arXiv 2026

[31] [33]

J. Li, A. Clegg, R. Mottaghi, J. Wu, X. Puig, and C. K. Liu. Controllable human-object interaction synthesis. InEuropean Conference on Computer Vision, 2024

2024

[32] [34]

Z. Wu, J. Li, P. Xu, and C. K. Liu. Human-object interaction from human-level instructions. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025

[33] [35]

BONES-SEED: Skeletal everyday embodiment dataset.https:// huggingface.co/datasets/bones-studio/seed, 2025

Bones Studio. BONES-SEED: Skeletal everyday embodiment dataset.https:// huggingface.co/datasets/bones-studio/seed, 2025. Hugging Face dataset. Accessed: 2026-05-22

2025

[34] [36]

J. Li, J. Wu, and C. K. Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics, 42(6):225:1–225:11, 2023. doi:10.1145/3618333. 11

work page doi:10.1145/3618333 2023

[35] [37]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019

2019

[36] [38]

Prokudin, C

S. Prokudin, C. Lassner, and J. Romero. Efficient learning on point clouds with basis point sets. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4332–4341, 2019

2019

[37] [39]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[38] [40]

S. Chen, Z. ang Cao, Z. Luo, F. Casta ˜neda, C. Li, T. Wang, Y . Yuan, L. J. Fan, C. K. Liu, and Y . Zhu. Chip: Adaptive compliance for humanoid control through hindsight perturbation,

[39] [41]

URLhttps://arxiv.org/abs/2512.14689

arXiv

[40] [42]

Y . Ma, Y . Zhou, Y . Yang, T. Wang, and H. Fan. Running vlas at real-time speed.arXiv preprint arXiv:2510.26742, 2025

arXiv 2025

[41] [43]

Z. Wu, J. Li, P. Xu, and C. K. Liu. Human-object interaction from human-level instructions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11176– 11186, 2025

2025

[42] [44]

B. Yi, V . Ye, M. Zheng, Y . Li, L. M ¨uller, G. Pavlakos, Y . Ma, J. Malik, and A. Kanazawa. Estimating body and hand motion in an ego-sensed world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[43] [45]

walk toward the chair

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Z ¨urbr¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, ...

Pith/arXiv arXiv 2025