pith. sign in

arxiv: 2606.27581 · v1 · pith:ZK52CHAAnew · submitted 2026-06-25 · 💻 cs.RO · cs.AI

SceneBot: Contact-Prompted General Humanoid Whole Body Tracking with Scene-Interaction

Pith reviewed 2026-06-29 01:17 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords humanoid controlreinforcement learningwhole-body trackingcontact-rich tasksmotion retargetingscene interactionpolicy conditioning
0
0 comments X

The pith

SceneBot unifies free-space and contact-rich humanoid tracking by conditioning one policy on reference motions plus per-link contact labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to build a single reinforcement-learning policy that lets humanoid robots perform both free-space locomotion and contact-rich tasks such as object manipulation or terrain traversal within one framework. It adds explicit per-link contact labels to the policy input so the controller knows which body parts should touch the environment at each moment. Because labeled interaction data does not exist, the authors reconstruct scenes after the fact from ordinary human motion captures and infer the required contact labels. A sympathetic reader would care because separate policies for different behaviors have so far blocked long-horizon tasks that mix walking, climbing, and carrying. If the approach holds, a humanoid could switch between free motion and contact behaviors without changing controllers or retraining.

Core claim

SceneBot trains one policy on reference motions and per-link contact labels that are obtained by hindsight scene reconstruction from retargeted human motion. The resulting policy handles freespace locomotion, uneven terrain, and whole-body manipulation, generalizes to motions and scenes outside the training set, and completes long-horizon tasks such as carrying a box upstairs. The work therefore presents contact conditioning as a practical interface that resolves physical ambiguities pure kinematic tracking cannot address.

What carries the argument

per-link contact conditioning, which supplies explicit expected interaction labels to a single policy so it can resolve contact ambiguities across locomotion and manipulation.

If this is right

  • A single policy executes both free-space and contact-rich behaviors without controller switching.
  • Training on 7.5 hours of reconstructed contact-rich data suffices for generalization to unseen motions and environments.
  • Contact conditioning provides a reusable interface that extends kinematic tracking to scene-interacting tasks.
  • Complex long-horizon sequences such as carrying objects upstairs become feasible within one learned controller.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contact-label interface could be tested on other robot morphologies or multi-robot coordination tasks.
  • If reconstruction errors accumulate on highly deformable objects, the method would require additional sensing or online label correction.
  • Real-robot transfer would need to verify that simulated contact labels remain valid under actuator noise and model mismatch.

Load-bearing premise

The hindsight scene reconstruction step produces sufficiently accurate per-link contact labels from retargeted human motion without introducing systematic errors that would prevent policy generalization.

What would settle it

Run the trained policy on a new scene where the reconstructed contact labels disagree with the actual geometry and physics; if the policy produces unstable or incorrect contacts while a version trained with ground-truth labels succeeds, the reconstruction assumption fails.

Figures

Figures reproduced from arXiv: 2606.27581 by C. Karen Liu, Guanya Shi, Jiaman Li, Shibo Zhao, Sirui Chen, Zhen Wu.

Figure 1
Figure 1. Figure 1: SceneBot enables a single motion tracking policy to accurately achieve free-form locomo [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scene reconstruction: SceneBot uses retargeted human motion to reconstruct scene as￾sets. It first builds the robot-scene interaction graph, then reconstruct plausible terrains and objects. Training: SceneBot trains a motion and contact tracking policy via reinforcement learning using contact-based rewards. Deployment: SceneBot relies on SuperOdometry [35] and an onboard IMU to estimate root position xroot… view at source ↗
Figure 3
Figure 3. Figure 3: Scene interaction graph for different in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a): Our scene reconstruction method can generate complex scenes that match the retar [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Composition of training data. 4.2 Tracking Performance for Different Motions Qualitatively, our method successfully executes behav￾iors such as stepping onto stairs of varying heights, pick￾ing up and carrying boxes, sitting down, and perform￾ing agile kicking and running, as demonstrated in the supplementary video. Additionally, our approach man￾ages long-horizon, simultaneous object and terrain inter￾act… view at source ↗
Figure 6
Figure 6. Figure 6: Drift in local track￾ing causes motion-terrain misalignment. Quantitatively, we evaluate tracking performance using the aver￾age global root tracking error, average joint tracking error, and suc￾cess rates across four task categories in a MuJoCo sim-to-sim envi￾ronment: free-space, terrain interaction, object interaction, sitting. We compare our method against the state-of-the-art general motion tracking p… view at source ↗
Figure 7
Figure 7. Figure 7: Root position error on the terrain task across dif￾ferent training steps. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Left: Omni-retarget creates an object￾hand mismatch. Right: Tracking performance comparison between scene asset reconstruction and scene-aware retargeting. Qualitatively, our pipeline can reconstructs complicate terrain from Lafan obstacle se￾quences ( [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Left: State estimation compared against mo [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Detailed breakdown of the state estimation results for the task of grasping a box and [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
read the original abstract

Current humanoid reinforcement-learning policies excel at free-space motions but struggle with contact-rich tasks, as pure kinematic tracking cannot resolve the physical ambiguities of interacting with objects and uneven terrain. To address this, we introduce SceneBot, a unified motion-tracking framework capable of handling freespace locomotion, terrain traversal, and whole-body manipulation. SceneBot conditions a single policy on both reference motions and per-link contact labels, explicitly defining expected environmental interactions. To overcome the lack of annotated interaction data, we propose a hindsight scene reconstruction approach that infers scene-interaction graphs from retargeted human motion. Trained on 7.5 hours of this reconstructed, contact-rich data, SceneBot successfully generalizes to unseen motions and environments. Our results demonstrate that SceneBot is the first general framework to seamlessly unify free-space and contact-rich behaviors executing complex, long-horizon tasks like carrying a box upstairs and establishing contact conditioning as a powerful interface for humanoid control. All code and data will be open-sourced. More demos and information are available at: https://ericcsr.github.io/scenebot/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SceneBot, a unified RL-based motion-tracking framework for humanoids that conditions a single policy on reference motions plus per-link contact labels. These labels are obtained via a hindsight scene-reconstruction pipeline applied to retargeted human motion; the resulting 7.5-hour dataset is used to train the policy, which is claimed to generalize to unseen motions and environments while seamlessly handling both free-space locomotion and contact-rich whole-body tasks such as carrying a box upstairs.

Significance. If the central claims hold, the work would be significant for humanoid control by demonstrating that explicit contact conditioning can serve as a general interface bridging free-space and interaction behaviors, with the open release of code and data providing a concrete resource for the community.

major comments (2)
  1. [Abstract / Methods (hindsight reconstruction pipeline)] The unification claim and generalization to long-horizon contact-rich tasks rest on the assumption that hindsight scene reconstruction produces sufficiently accurate per-link contact labels. No quantitative validation of label fidelity (e.g., precision/recall of contact timing and location against physics simulation or ground-truth scenes) is reported, so systematic retargeting biases could cause the policy to overfit to reconstruction artifacts rather than true dynamics.
  2. [Abstract / Results] The abstract asserts successful generalization to unseen motions and environments and to complex tasks, yet supplies no quantitative results, ablation studies, tracking-error metrics, or success rates. Without these data the evidence supporting the central claim cannot be evaluated.
minor comments (1)
  1. [Abstract] The manuscript should include a clear description of the policy architecture, observation space, and reward terms in the main text rather than relying solely on the project page.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to each major comment below, indicating planned revisions where the manuscript can be strengthened.

read point-by-point responses
  1. Referee: [Abstract / Methods (hindsight reconstruction pipeline)] The unification claim and generalization to long-horizon contact-rich tasks rest on the assumption that hindsight scene reconstruction produces sufficiently accurate per-link contact labels. No quantitative validation of label fidelity (e.g., precision/recall of contact timing and location against physics simulation or ground-truth scenes) is reported, so systematic retargeting biases could cause the policy to overfit to reconstruction artifacts rather than true dynamics.

    Authors: We agree this is a valid point and that explicit validation of label quality would strengthen the paper. In the revision we will add a dedicated analysis (new subsection in Methods or Experiments) that reports precision, recall, and timing error for contact labels on a held-out set of motions, obtained by comparing the reconstructed labels against forward simulation in the target scenes. This will directly address potential retargeting biases. revision: yes

  2. Referee: [Abstract / Results] The abstract asserts successful generalization to unseen motions and environments and to complex tasks, yet supplies no quantitative results, ablation studies, tracking-error metrics, or success rates. Without these data the evidence supporting the central claim cannot be evaluated.

    Authors: The full results section already presents quantitative tracking-error curves, per-task success rates (including the box-carrying example), and ablations on contact conditioning versus baselines. To make the strength of evidence immediately visible, we will revise the abstract to include concise numerical highlights drawn from those results (e.g., mean tracking error and success rate ranges). revision: partial

Circularity Check

0 steps flagged

No circularity: derivation is self-contained with external data pipeline

full rationale

The paper presents a policy trained on contact labels generated via an external hindsight reconstruction process from retargeted human motion data. No equations, fitted parameters, or predictions are shown that reduce to the inputs by construction. The generalization claim is to unseen motions and environments, which are independent of the training set. No self-citation chains or uniqueness theorems are invoked in the provided text to support the central result. The method is a standard supervised training pipeline on reconstructed labels, with no load-bearing step that equates the output to the input definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The hindsight reconstruction step implicitly assumes accurate contact inference from mocap retargeting, but this is not formalized.

pith-pipeline@v0.9.1-grok · 5733 in / 1104 out tokens · 28928 ms · 2026-06-29T01:17:53.705707+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

    cs.RO 2026-06 unverdicted novelty 6.0

    Generates 48,000 synthetic VLK trajectories in 3D-reconstructed scenes to train a policy for egocentric perception-based humanoid navigation and object transport, shown on physical Unitree G1 robot.

Reference graph

Works this paper leans on

37 extracted references · 2 canonical work pages · cited by 1 Pith paper

  1. [1]

    Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

  2. [2]

    X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne. Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

  3. [3]

    T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole- body skills.arXiv preprint arXiv:2502.01143, 2025

  4. [4]

    Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

  5. [5]

    Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

  6. [6]

    Zhang, J

    Z. Zhang, J. Guo, C. Chen, J. Wang, C. Lin, Y . Lian, H. Xue, Z. Wang, M. Liu, J. Lyu, et al. Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025

  7. [7]

    Y . Ze, Z. Chen, J. P. Ara´ujo, Z.-a. Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

  8. [8]

    Q. Lu, Y . Feng, B. Shi, M. Piseno, Z. Bao, and C. K. Liu. Gentlehumanoid: Learn- ing upper-body compliance for contact-rich human and object interaction.arXiv preprint arXiv:2511.04679, 2025

  9. [9]

    Chen, Z.-a

    S. Chen, Z.-a. Cao, Z. Luo, F. Casta ˜neda, C. Li, T. Wang, Y . Yuan, L. Fan, C. K. Liu, Y . Zhu, et al. Chip: Adaptive compliance for humanoid control through hindsight perturbation.arXiv preprint arXiv:2512.14689, 2025

  10. [10]

    Mahmood, N

    N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

  11. [11]

    Saito, J

    J. Saito, J. Li, M. de Ruyter, M. Guerrero, E. Lim, E. Hassani, R. B. Ribera, H. Moon, M. Dadela, M. D. Lucca, Q. Wang, X. Li, J. Kautz, S. Yuen, and U. Iqbal. Soma: Unify- ing parametric human body models.arXiv preprint arXiv:2603.16858, 2026. URLhttps: //arxiv.org/abs/2603.16858

  12. [12]

    J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

  13. [13]

    Mittal, P

    M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Munoz, X. Yao, R. Zurbr ¨ugg, N. Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi- modal robot learning.arXiv preprint arXiv:2511.04831, 2025

  14. [14]

    Zakka, Q

    K. Zakka, Q. Liao, B. Yi, L. L. Lay, K. Sreenath, and P. Abbeel. mjlab: A lightweight frame- work for gpu-accelerated robot learning.arXiv preprint arXiv:2601.22074, 2026

  15. [15]

    Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025. 10

  16. [16]

    Deits and R

    R. Deits and R. Tedrake. Footstep planning on uneven terrain with mixed-integer convex optimization. In2014 IEEE-RAS international conference on humanoid robots, pages 279–

  17. [17]

    Kuindersma, R

    S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela, H. Dai, F. Permenter, T. Koolen, P. Marion, and R. Tedrake. Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot.Autonomous robots, 40(3):429–455, 2016

  18. [18]

    Q. Ben, B. Xu, K. Li, F. Jia, W. Zhang, J. Wang, J. Wang, D. Lin, and J. Pang. Gallant: V oxel grid-based humanoid locomotion and local-navigation across 3d constrained terrains.arXiv preprint arXiv:2511.14625, 2025

  19. [19]

    Zhang, Y

    Y . Zhang, Y . Seo, J. Chen, Y . Yuan, K. Sreenath, P. Abbeel, C. Sferrazza, K. Liu, R. Duan, and G. Shi. Rpl: Learning robust humanoid perceptive locomotion on challenging terrains.arXiv preprint arXiv:2602.03002, 2026

  20. [20]

    Zhang, V

    C. Zhang, V . Klemm, F. Yang, and M. Hutter. Ame-2: Agile and generalized legged locomotion via attention-based neural map encoding.arXiv preprint arXiv:2601.08485, 2026

  21. [21]

    L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

  22. [22]

    Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, et al. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

  23. [23]

    X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4): 1–20, 2021

  24. [24]

    S. Zhu, Z. Zhuang, M. Zhao, K.-Y . Lee, and H. Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026

  25. [25]

    M. Xu, Y . Shi, K. Yin, and X. B. Peng. Parc: Physics-based augmentation with reinforcement learning for character controllers. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

  26. [26]

    Zhang, K

    Z. Zhang, K. Wen, M. Xu, J. He, C. Li, T. Miki, C. Schwarke, C. Zhang, X. B. Peng, and M. Hutter. Learning whole-body humanoid locomotion via motion generation and motion tracking.arXiv preprint arXiv:2604.17335, 2026

  27. [27]

    Bjorck, F

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  28. [28]

    S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

  29. [29]

    J. Li, J. Wu, and C. K. Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6):1–11, 2023

  30. [30]

    Lu, C.-H

    J. Lu, C.-H. P. Huang, U. Bhattacharya, Q. Huang, and Y . Zhou. Humoto: A 4d dataset of mocap human object interactions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10886–10897, 2025

  31. [31]

    H. Weng, Y . Li, N. Sobanbabu, Z. Wang, Z. Luo, T. He, D. Ramanan, and G. Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025. 11

  32. [32]

    Z. Wu, J. Li, P. Xu, and C. K. Liu. Human-object interaction from human-level instructions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11176– 11186, 2025

  33. [33]

    S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. Resmimic: From gen- eral motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

  34. [34]

    Zhang, W

    C. Zhang, W. Xiao, T. He, and G. Shi. Wococo: Learning whole-body humanoid control with sequential contacts.arXiv preprint arXiv:2406.06005, 2024

  35. [35]

    S. Zhao, H. Zhang, P. Wang, L. Nogueira, and S. Scherer. Super odometry: Imu-centric lidar-visual-inertial estimator for challenging environments. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8729–8736. IEEE, 2021

  36. [36]

    Jiang, Y

    Y . Jiang, Y . Ye, D. Gopinath, J. Won, A. W. Winkler, and C. K. Liu. Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation. InSIGGRAPH Asia 2022 Conference Papers, SA ’22, page 1–9. ACM, Nov. 2022. doi:10.1145/3550469.3555428. URLhttp://dx.doi.org/10.1145/3550469.3555428

  37. [37]

    F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and C. Pal. Robust motion in-betweening. ACM Transactions on Graphics, 39(4), Aug. 2020. ISSN 1557-7368. doi:10.1145/3386569. 3392480. URLhttp://dx.doi.org/10.1145/3386569.3392480. 12 A Scene Reconstruction Algorithm Algorithm 1Scene Reconstruction from Human Motion Require:Human kinematic motionM human Ensure:R...