pith. sign in

arxiv: 2605.15517 · v1 · pith:53FO7AAMnew · submitted 2026-05-15 · 💻 cs.RO · cs.SY· eess.SY

Terrain Consistent Reference-Guided RL for Humanoid Navigation Autonomy

Pith reviewed 2026-05-19 15:18 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY
keywords humanoid locomotionreinforcement learningreference-guided policyterrain adaptationSE(2) navigationautonomous navigationUnitree G1
0
0 comments X

The pith

Modulating reference trajectories to fit terrain geometry inside RL training produces humanoid policies that track SE(2) velocity commands reliably on rough outdoor ground and stairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to train reference-guided reinforcement learning policies for humanoid locomotion by adjusting the reference trajectories during training so they remain consistent with the terrain. Desired footsteps are projected onto valid footholds while swing-foot and center-of-mass paths are shifted to match the local geometry, all inside the simulation loop. The resulting policy presents a simple SE(2) velocity interface that standard navigation planners can use directly. This setup improves reference tracking in simulation and supports closed-loop hardware runs longer than 70 meters on the Unitree G1 through mixed rough terrain and stair flights using only onboard sensing and computation.

Core claim

Synthesizing SE(2)-controllable reference trajectories inside the RL training loop and modulating them by projecting footsteps onto valid footholds and reshaping swing-foot and center-of-mass trajectories to the terrain produces a perceptive locomotion policy that maintains stable tracking while exposing a clean velocity command interface for integration with MPC and control barrier function planners.

What carries the argument

terrain-consistent reference modulation: the process of projecting desired footsteps onto valid footholds and adjusting swing-foot and center-of-mass trajectories to match terrain geometry inside the RL training loop

If this is right

  • Environmentally-conditioned references produce significantly better tracking performance than environment-agnostic references in simulation.
  • The trained policy integrates directly with an MPC plus control barrier function planner to achieve closed-loop autonomous navigation.
  • Long-horizon runs exceeding 70 meters are demonstrated on the Unitree G1 through outdoor rough terrain and consecutive stair flights with all sensing and computation performed onboard.
  • The policy exposes a standard SE(2) velocity interface that is compatible with existing navigation autonomy infrastructure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modulation step could be applied to other reference-based controllers beyond reinforcement learning to improve terrain robustness.
  • Replacing the terrain projection module with online perception would allow the policy to handle previously unseen terrain without retraining.
  • The clean velocity interface may simplify the design of higher-level planners that treat the locomotion layer as a black-box controllable system.

Load-bearing premise

Projecting footsteps and reshaping trajectories to match terrain inside the training loop yields stable, generalizable signals that transfer to real hardware without introducing instability or poor tracking on unseen terrain shapes.

What would settle it

A controlled test in which the policy is run on terrain geometries absent from the training modulation step and shows either loss of balance or substantially worse reference tracking than the environment-agnostic baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15517 by Aaron D. Ames, William D. Compton, Zachary Olkin.

Figure 1
Figure 1. Figure 1: LIP-CLF RL trains a unstructured terrain locomotion policy using [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System Architecture Diagram. References are generated in the CLF-RL training loop, and modified to be consistent with the environment. When [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Modification of the LIP reference for terrain consistency. (a) The [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Outdoor autonomous navigation. The robot is localized within an existing map, while MPC plans trajectories through the environment over rough [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

We present a method for training reference-guided, perceptive reinforcement learning locomotion policies for humanoid robots in which reference trajectories are modulated in training to be consistent with terrain geometry. Aiming to deploy our method with standard navigation autonomy infrastructure, we synthesize SE(2)-controllable reference trajectories inside the RL training loop, projecting desired footsteps onto valid footholds and adjusting swing-foot and center-of-mass trajectories to match the terrain. The resulting policy exposes a clean SE(2) velocity interface compatible with standard navigation planners. In simulation, environmentally-conditioned references significantly improve reference tracking performance compared to environment agnostic references. On hardware, we integrate the policy with an MPC + control barrier function planner and demonstrate long-horizon (>70m) closed-loop autonomous navigation on the Unitree G1 through outdoor environments containing rough terrain and consecutive flights of stairs, with all sensing and computation onboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper presents a method for training reference-guided, perceptive RL locomotion policies for humanoid robots. Reference trajectories are modulated inside the RL training loop to ensure consistency with terrain geometry by projecting desired footsteps onto valid footholds and adjusting swing-foot and center-of-mass trajectories accordingly. This yields a policy exposing a clean SE(2) velocity interface. In simulation, environmentally-conditioned references improve reference tracking over environment-agnostic ones. On hardware, the policy is integrated with an MPC + control barrier function planner to demonstrate long-horizon (>70 m) closed-loop autonomous navigation on the Unitree G1 through outdoor rough terrain and consecutive stairs, using only onboard sensing and computation.

Significance. If the empirical results hold, the work offers a practical bridge between RL-based locomotion and standard navigation planners for humanoids, addressing terrain-induced reference mismatches that often hinder sim-to-real transfer. The terrain-consistent modulation inside the training loop and the extended hardware demonstration with MPC+CBF integration are notable strengths, providing a clean velocity interface without requiring perfect terrain knowledge at deployment time.

major comments (1)
  1. [Results] Results section: The claim that environmentally-conditioned references 'significantly improve reference tracking performance' is central to the contribution, yet the manuscript provides no quantitative metrics (e.g., tracking error, success rate), baseline comparisons, error bars, or statistical tests; this weakens the ability to assess the magnitude and reliability of the reported improvement.
minor comments (3)
  1. [Abstract] Abstract and §3: The description of the footstep projection and trajectory adjustment procedure would benefit from an explicit equation or pseudocode block showing how the modulated reference is computed from the SE(2) command and terrain map.
  2. [Hardware Experiments] Hardware experiments: Details on RL hyperparameters, training data exclusion criteria, and any domain randomization used during training are missing; adding these would improve reproducibility.
  3. [Figures] Figure captions: Several figures lack axis labels or units, and the terrain height map visualization in the hardware trials could be clarified with scale bars.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the single major comment below and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [Results] Results section: The claim that environmentally-conditioned references 'significantly improve reference tracking performance' is central to the contribution, yet the manuscript provides no quantitative metrics (e.g., tracking error, success rate), baseline comparisons, error bars, or statistical tests; this weakens the ability to assess the magnitude and reliability of the reported improvement.

    Authors: We agree that explicit quantitative support is required for the central claim. In the revised manuscript we will add mean and standard-deviation tracking errors for CoM and foot positions, success rates over repeated simulation trials, direct numerical comparison against the environment-agnostic baseline, error bars, and appropriate statistical tests (e.g., paired t-tests) to demonstrate significance and reliability of the observed improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical RL training procedure that modulates SE(2) reference trajectories inside the loop by projecting footsteps and adjusting swing-foot/CoM paths to terrain geometry, then evaluates the resulting policy via simulation comparisons against environment-agnostic baselines and hardware integration with an MPC+CBF planner. No load-bearing derivation, prediction, or uniqueness claim reduces by the paper's own equations or self-citations to a definitional equivalence or fitted input; the reported improvements and long-horizon navigation results rest on experimental outcomes rather than internal redefinition of quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on standard RL training assumptions plus one key domain assumption about real-time terrain sensing and projection being both feasible and beneficial for policy learning.

axioms (1)
  • domain assumption Terrain geometry can be accurately sensed and used to project footsteps onto valid footholds while adjusting swing-foot and center-of-mass trajectories in real time during both training and deployment.
    Invoked directly in the synthesis of references inside the RL loop and in the hardware integration with the MPC planner.

pith-pipeline@v0.9.0 · 5680 in / 1558 out tokens · 86997 ms · 2026-05-19T15:18:39.611045+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

  1. [1]

    Dtc: Deep tracking control,

    F. Jenelten, J. He, F. Farshidian, and M. Hutter, “Dtc: Deep tracking control,”Science Robotics, vol. 9, no. 86, p. eadh5401, 2024

  2. [2]

    Wanget al.BeamDojo: Learning Agile Humanoid Locomotion on Sparse Footholds

    H. Wanget al.BeamDojo: Learning Agile Humanoid Locomotion on Sparse Footholds. [Online]. Available: http://arxiv.org/abs/2502.10363

  3. [3]

    Benet al.Gallant: V oxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains

    Q. Benet al.Gallant: V oxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains. [Online]. Available: http://arxiv.org/abs/2511.14625

  4. [4]

    Learning humanoid locomotion with perceptive internal model,

    J. Longet al., “Learning humanoid locomotion with perceptive internal model,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 9997–10 003

  5. [5]

    Attention-based map encoding for learning generalized legged loco- motion,

    J. He, C. Zhang, F. Jenelten, R. Grandia, M. B ¨acher, and M. Hutter, “Attention-based map encoding for learning generalized legged loco- motion,”Science Robotics, vol. 10, no. 105, p. eadv3604, 2025

  6. [6]

    Rpl: Learning robust humanoid perceptive locomotion on challenging terrains,

    Y . Zhanget al., “Rpl: Learning robust humanoid perceptive locomotion on challenging terrains,”arXiv preprint arXiv:2602.03002, 2026

  7. [7]

    Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

    Z. Wuet al., “Perceptive humanoid parkour: Chaining dynamic human skills via motion matching,”arXiv preprint arXiv:2602.15827, 2026

  8. [8]

    Lipm- guided reinforcement learning for stable and perceptive locomotion in bipedal robots,

    H. Su, H. Luo, S. Yang, K. Jiang, W. Zhang, and H. Chen, “Lipm- guided reinforcement learning for stable and perceptive locomotion in bipedal robots,” in2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids). IEEE, 2025, pp. 1031–1038

  9. [9]

    BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

    Q. Liaoet al., “Beyondmimic: From motion tracking to ver- satile humanoid control via guided diffusion,”arXiv preprint arXiv:2508.08241, 2025

  10. [10]

    Zest: Zero-shot embodied skill transfer for athletic robot control,

    J. P. Sleimanet al., “Zest: Zero-shot embodied skill transfer for athletic robot control,”arXiv preprint arXiv:2602.00401, 2026

  11. [11]

    Asap: Agile and safe pursuit for local planning of autonomous mobile robots,

    D.-H. Lee, S. Choi, and K.-I. Na, “Asap: Agile and safe pursuit for local planning of autonomous mobile robots,”IEEe Access, vol. 12, pp. 99 600–99 613, 2024

  12. [12]

    Visual imitation enables contextual humanoid control,

    A. Allshireet al., “Visual imitation enables contextual humanoid control,”arXiv preprint arXiv:2505.03729, 2025

  13. [13]

    Opt2skill: Imitating dynamically-feasible whole-body trajectories for versatile humanoid loco-manipulation,

    F. Liuet al., “Opt2skill: Imitating dynamically-feasible whole-body trajectories for versatile humanoid loco-manipulation,”IEEE Robotics and Automation Letters, 2025

  14. [14]

    Clf-rl: Control lyapunov function guided reinforcement learning,

    K. Li, Z. Olkin, Y . Yue, and A. D. Ames, “Clf-rl: Control lyapunov function guided reinforcement learning,”IEEE Robotics and Automa- tion Letters, 2026

  15. [15]

    Chasing autonomy: Dynamic retargeting and control guided rl for performant and controllable humanoid running,

    Z. Olkin, W. D. Compton, R. M. Bena, and A. D. Ames, “Chasing autonomy: Dynamic retargeting and control guided rl for performant and controllable humanoid running,”arXiv preprint arXiv:2603.25902, 2026

  16. [16]

    Efficient anytime clf reactive planning system for a bipedal robot on undulating terrain,

    J.-K. Huang and J. W. Grizzle, “Efficient anytime clf reactive planning system for a bipedal robot on undulating terrain,”IEEE Transactions on Robotics, vol. 39, no. 3, pp. 2093–2110, 2023

  17. [17]

    State-nav: Stability- aware traversability estimation for bipedal navigation on rough ter- rain,

    Z. Yoon, L. Y . Zhu, J. Lu, L. Gan, and Y . Zhao, “State-nav: Stability- aware traversability estimation for bipedal navigation on rough ter- rain,”IEEE Robotics and Automation Letters, vol. 11, no. 2, pp. 2338– 2345, 2025

  18. [18]

    Xiong, J

    X. Xiong, J. Reher, and A. Ames. Global Position Control on Underactuated Bipedal Robots: Step-to-step Dynamics Approximation for Step Planning. [Online]. Available: http://arxiv.org/abs/2011.06050

  19. [19]

    Navila: Legged robot vision-language-action model for navigation,

    A.-C. Chenget al., “Navila: Legged robot vision-language-action model for navigation,”arXiv preprint arXiv:2412.04453, 2024

  20. [20]

    Humanoid parkour learning,

    Z. Zhuang, S. Yao, and H. Zhao, “Humanoid parkour learning,”arXiv preprint arXiv:2406.10759, 2024

  21. [21]

    Dynamic bipedal turning through sim-to-real reinforcement learning,

    F. Yu, R. Batke, J. Dao, J. Hurst, K. Green, and A. Fern, “Dynamic bipedal turning through sim-to-real reinforcement learning,” in2022 IEEE-RAS 21st International Conference on Humanoid Robots (Hu- manoids). IEEE, 2022, pp. 903–910

  22. [22]

    Integrating model-based footstep planning with model-free reinforcement learning for dynamic legged locomotion,

    H. J. Lee, S. Hong, and S. Kim, “Integrating model-based footstep planning with model-free reinforcement learning for dynamic legged locomotion,” in2024 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS). IEEE, 2024, pp. 11 248–11 255

  23. [23]

    Nebula: Team costar’s robotic autonomy solution that won phase ii of darpa subterranean challenge,

    A. Aghaet al., “Nebula: Team costar’s robotic autonomy solution that won phase ii of darpa subterranean challenge,”Field robotics, vol. 2, pp. 1432–1506, 2022

  24. [24]

    Step: Stochastic traversability evaluation and planning for risk-aware navigation; results from the darpa subterranean challenge,

    A. Dixit, D. D. Fan, K. Otsu, S. Dey, A.-A. Agha-Mohammadi, and J. Burdick, “Step: Stochastic traversability evaluation and planning for risk-aware navigation; results from the darpa subterranean challenge,” Field Robotics, vol. 4, pp. 182–210, 2024

  25. [25]

    Long-horizon humanoid navigation planning using traversability estimates and previous experience,

    Y .-C. Lin and D. Berenson, “Long-horizon humanoid navigation planning using traversability estimates and previous experience,”Au- tonomous Robots, vol. 45, no. 6, pp. 937–956, 2021

  26. [26]

    Xiong and A

    X. Xiong and A. Ames. 3D Underactuated Bipedal Walking via H-LIP based Gait Synthesis and Stepping Stabilization. [Online]. Available: http://arxiv.org/abs/2101.09588

  27. [27]

    ZED SDK,

    Stereolabs, “ZED SDK,” https://www.stereolabs.com/developers/, 2024, version 5.1

  28. [28]

    Discrete control barrier functions for safety-critical control of discrete systems with application to bipedal robot navigation

    A. Agrawal and K. Sreenath, “Discrete control barrier functions for safety-critical control of discrete systems with application to bipedal robot navigation.” inRobotics: Science and Systems, vol. 13. Cambridge, MA, USA, 2017, pp. 1–10

  29. [29]

    Safety-critical controller synthesis with reduced- order models,

    M. H. Cohen, N. Csomay-Shanklin, W. D. Compton, T. G. Molnar, and A. D. Ames, “Safety-critical controller synthesis with reduced- order models,” in2025 American Control Conference (ACC). IEEE, 2025, pp. 5216–5221

  30. [31]
  31. [32]

    Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

    C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter, “Rsl-rl: A learning library for robotics research,”arXiv preprint arXiv:2509.10771, 2025

  32. [33]

    LIO-Localization: A ROS2 workspace for LiDAR- inertial odometry-based mapping and localization,

    W. Compton, “LIO-Localization: A ROS2 workspace for LiDAR- inertial odometry-based mapping and localization,” https://github.com/ wdc3iii/LIO-Localization, 2026, accessed: 2026-04-26

  33. [34]

    Fast-lio: A fast, robust lidar-inertial odometry package by tightly-coupled iterated kalman filter,

    W. Xu and F. Zhang, “Fast-lio: A fast, robust lidar-inertial odometry package by tightly-coupled iterated kalman filter,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 3317–3324, 2021

  34. [35]

    Fast-lio2: Fast direct lidar- inertial odometry,

    W. Xu, Y . Cai, D. He, J. Lin, and F. Zhang, “Fast-lio2: Fast direct lidar- inertial odometry,”IEEE Transactions on Robotics, vol. 38, no. 4, pp. 2053–2073, 2022

  35. [36]

    Generalized-ICP,

    A. Segal, D. Haehnel, and S. Thrun, “Generalized-ICP,” inProceed- ings of Robotics: Science and Systems, Seattle, USA, June 2009

  36. [37]

    CasADi – A software framework for nonlinear optimization and optimal control,

    J. A. E. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl, “CasADi – A software framework for nonlinear optimization and optimal control,”Mathematical Programming Computation, vol. 11, no. 1, pp. 1–36, 2019

  37. [38]

    On the implementation of an interior- point filter line-search algorithm for large-scale nonlinear program- ming,

    A. W ¨achter and L. T. Biegler, “On the implementation of an interior- point filter line-search algorithm for large-scale nonlinear program- ming,”Mathematical programming, vol. 106, no. 1, pp. 25–57, 2006