MARCH: Model-Assisted Reinforcement Learning for the Perceptive Control of Humanoids over Sparse Footholds

Codrin Crismariu; Ryan K. Cosner

arxiv: 2606.10288 · v1 · pith:CKGVO3FVnew · submitted 2026-06-09 · 💻 cs.RO

MARCH: Model-Assisted Reinforcement Learning for the Perceptive Control of Humanoids over Sparse Footholds

Codrin Crismariu , Ryan K. Cosner This is my paper

Pith reviewed 2026-06-27 13:26 UTC · model grok-4.3

classification 💻 cs.RO

keywords reinforcement learninghumanoid locomotionperceptive controlsparse footholdsmodel-assisted RLcontrol Lyapunov functionpolicy distillationbipedal walking

0 comments

The pith

Model-assisted RL produces safe vision-only humanoid locomotion over sparse footholds by distilling from a privileged teacher guided by simplified-model references.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve perceptive bipedal locomotion on sparse terrain by blending model-based safety with model-free robustness. It generates reference trajectories from simplified models, uses them to shape a control Lyapunov function reward for training a privileged teacher policy, and then distills that policy into a vision-only student. A sympathetic reader would care because the approach claims to cut sample needs, avoid heavy curricula, and deliver smoother motion while matching the stepping performance of pure model-free methods. The work validates this in simulation and on a real Unitree G1 robot.

Core claim

The central claim is that the three-step model-assisted procedure—generating safe references from simplified models, training a privileged teacher via CLF rewards around those references, and distilling to a vision student—yields physically grounded locomotion that improves sample efficiency, reduces curriculum complexity, produces smoother behavior, and reaches stepping-stone performance comparable to model-free baselines, with successful real-robot deployment.

What carries the argument

The three-step model-assisted RL framework that builds a CLF reward from safe reference trajectories generated by simplified models, trains a privileged teacher policy, and distills it to a vision-based student.

If this is right

Training requires fewer samples than pure model-free RL on the same task.
The method avoids the need for elaborate staged curricula to discover precise foot placements.
Locomotion trajectories become smoother while retaining comparable success rates on stepping stones.
The distilled vision policy can be deployed directly on hardware such as the Unitree G1 without additional fine-tuning.
The same reference-generation plus distillation pattern can be applied to other constrained locomotion problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework may extend to terrains with moving footholds if the simplified models are updated accordingly.
Policy distillation from privileged simulation information could reduce the sim-to-real gap for other perception-heavy robot tasks.
The approach implies that hybrid model-RL methods can be tuned primarily through the choice of reference model rather than reward shaping alone.

Load-bearing premise

Simplified models produce reference trajectories that stay safe and useful enough to shape the CLF reward so the teacher's behavior transfers to the vision-only student without large performance loss.

What would settle it

Run the vision-only student policy on the same sparse-foothold courses used for the teacher and model-free baselines; if the student shows markedly higher fall rates or lower success than the baselines, the transfer claim fails.

Figures

Figures reproduced from arXiv: 2606.10288 by Codrin Crismariu, Ryan K. Cosner.

**Figure 2.** Figure 2: Framework Overview. Our framework involves three core components: (red) Generation of model-based safe reference trajectories, (blue) Training of a privileged teacher control policy that uses control Lyapunov function (CLF)-inspired rewards in a CLF-RL [7] framework, and (green) The distillation of the teacher policy into a student policy that only relies on the visual and proprioceptive data accessible d… view at source ↗

**Figure 3.** Figure 3: (Left) mjlab simulation environment with the Unitree G1 humanoid robot. The modelbased planner uses the true height map (shown in red and yellow) to create a safe foothold plan represented by the yellow arrows. The camera perspective used to train the student policy is shown in white near the robot’s torso. (Right) Results of the ablation study for training the student policy, πS,ϕ. Over the course of 100… view at source ↗

**Figure 4.** Figure 4: Comparison between model-informed (ours) in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Perceptive bipedal locomotion over sparse terrain remains a difficult challenge: model-based methods are precise but brittle to uncertainty, while model-free methods are robust but struggle to discover the precise, constrained motions required for safety-critical locomotion where small errors can cause catastrophic failures. We propose a model-assisted reinforcement learning (RL) framework that combines both perspectives in three steps: (1) generate a safe reference trajectory using simplified models; (2) train a privileged teacher policy guided by a control Lyapunov function (CLF) reward built around the safe reference trajectory; and (3) distill the teacher into a vision-based student policy. We show that this model-assistance procedure produces physically grounded locomotion, improving sample efficiency, reducing the need for a complex learning curriculum, and achieving smoother locomotion behavior alongside stepping stone performance comparable to model-free baselines. We validate our approach in simulation and demonstrate successful deployment on a Unitree G1 humanoid robot navigating sparse footholds with lateral constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARCH gives a workable three-step pipeline for blending simplified-model references into a CLF teacher then distilling to vision for humanoid sparse-foothold walking, with a real G1 demo, but the abstract supplies no numbers to back the efficiency and smoothness claims.

read the letter

The paper's core contribution is a concrete pipeline: generate reference trajectories from simplified models, shape a privileged teacher's reward with a CLF around those references, then distill the teacher into a vision-only student. They report that this produces grounded locomotion on stepping stones, cuts curriculum complexity, and runs on the Unitree G1 with lateral constraints.

The real-robot deployment is the clearest positive. Hardware results on sparse footholds matter more than another simulation-only claim, and the problem setup (perceptive bipedal locomotion where small errors are catastrophic) is well chosen.

The integration itself is not a radical departure—model-assisted RL and CLF rewards already exist—but the specific ordering and the claim that it reduces the need for complex curricula are worth checking. The abstract states comparable stepping-stone performance to model-free baselines plus smoother behavior, yet gives no metrics, ablations, or error breakdowns. That absence makes it hard to judge how much the model assistance actually helps.

The central assumption is that the simplified-model trajectories remain safe and useful once perception noise and model mismatch enter the picture, and that distillation preserves the safety properties. The stress-test note flags exactly this point, and the abstract does not resolve it with data.

This is work for the legged-robotics community that already follows hybrid model-RL methods. A reader looking for practical ideas on perception-based foothold selection will find something to try. It is coherent on its own terms and includes hardware evidence, so it clears the bar for serious refereeing even if the quantitative support needs tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MARCH, a three-step model-assisted RL framework for perceptive bipedal locomotion over sparse footholds. It generates safe reference trajectories from simplified models, trains a privileged teacher policy using a control Lyapunov function (CLF) reward constructed around those references, and distills the teacher into a vision-only student policy. The authors claim the procedure yields physically grounded locomotion with improved sample efficiency, reduced need for complex curricula, smoother behavior, stepping-stone performance comparable to model-free baselines, and successful real-robot deployment on a Unitree G1 humanoid.

Significance. If the quantitative claims hold, the work offers a practical hybrid route between brittle model-based planning and sample-inefficient model-free RL for safety-critical locomotion. Demonstrating that simplified-model references can safely shape a CLF reward and that the resulting teacher transfers to a vision student without catastrophic degradation would be a useful contribution to humanoid control under perceptual constraints.

major comments (2)

[Abstract] Abstract: the central claims of 'improving sample efficiency,' 'reducing the need for a complex learning curriculum,' and 'achieving smoother locomotion behavior' are asserted without any numerical metrics, training curves, ablation results, or statistical comparisons. These assertions are load-bearing for the paper's contribution and must be supported by concrete data (e.g., sample counts to reach a success threshold, curriculum stage counts, or smoothness metrics such as jerk or torque variance) in the results section.
[Abstract, §3] Abstract and §3 (method): the claim that simplified-model references remain 'safe and useful' when used to construct the CLF reward is stated without reported analysis of model mismatch, reference feasibility under uncertainty, or failure cases where the reference leads the teacher into unsafe states. A quantitative assessment of reference quality (e.g., tracking error or safety violation rate) is required to substantiate the three-step pipeline.

minor comments (1)

[Abstract] The abstract mentions 'lateral constraints' on the footholds but does not define how these constraints are encoded in either the simplified model or the CLF reward; this notation should be clarified in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to provide the requested quantitative support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 'improving sample efficiency,' 'reducing the need for a complex learning curriculum,' and 'achieving smoother locomotion behavior' are asserted without any numerical metrics, training curves, ablation results, or statistical comparisons. These assertions are load-bearing for the paper's contribution and must be supported by concrete data (e.g., sample counts to reach a success threshold, curriculum stage counts, or smoothness metrics such as jerk or torque variance) in the results section.

Authors: The results section includes training curves and comparisons that support these claims, but we agree that the abstract lacks specific numbers. We will update the abstract to include concrete metrics from our experiments, such as the number of environment steps to reach a success threshold and smoothness metrics like torque variance, along with statistical comparisons to baselines. revision: yes
Referee: [Abstract, §3] Abstract and §3 (method): the claim that simplified-model references remain 'safe and useful' when used to construct the CLF reward is stated without reported analysis of model mismatch, reference feasibility under uncertainty, or failure cases where the reference leads the teacher into unsafe states. A quantitative assessment of reference quality (e.g., tracking error or safety violation rate) is required to substantiate the three-step pipeline.

Authors: We acknowledge that a detailed quantitative assessment of the reference quality is not present in the current manuscript. We will add this analysis, reporting tracking errors, feasibility under uncertainty, and safety violation rates, to support the safety and usefulness of the simplified-model references in the CLF reward construction. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description outline a standard three-step teacher-student distillation pipeline (simplified-model reference generation, privileged teacher training with CLF reward, vision-student distillation) without any equations, fitted parameters, or self-citations that reduce claimed performance metrics or derivations to their own inputs by construction. No load-bearing step equates a prediction to a fitted input or imports uniqueness via self-citation chains. The central claims rest on empirical validation in simulation and hardware rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, hyperparameters, or modeling assumptions that can be audited; therefore the ledger remains empty.

pith-pipeline@v0.9.1-grok · 5701 in / 1049 out tokens · 18881 ms · 2026-06-27T13:26:46.800043+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 7 canonical work pages

[1]

Nguyen, A

Q. Nguyen, A. Hereid, J. W. Grizzle, A. D. Ames, and K. Sreenath. 3d dynamic walking on stepping stones with control barrier functions. In2016 IEEE 55th Conference on Decision and Control (CDC), pages 827–834. IEEE, 2016

2016
[2]

Csomay-Shanklin, R

N. Csomay-Shanklin, R. K. Cosner, M. Dai, A. J. Taylor, and A. D. Ames. Episodic learning for safe bipedal locomotion with control barrier functions and projection-to-state safety. In Proceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144 of Proceedings of Machine Learning Research, pages 1041–1053. PMLR, 07 – 08 June 2021. URL...

2021
[3]

Grandia, A

R. Grandia, A. J. Taylor, A. D. Ames, and M. Hutter. Multi-layered safety for legged robots via control barrier functions and model predictive control. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 8352–8358. IEEE, 2021

2021
[4]

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust per- ceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62), 2022

2022
[5]

Q. Ben, B. Xu, K. Li, F. Jia, W. Zhang, J. Wang, J. Wang, D. Lin, and J. Pang. Gallant: V oxel grid-based humanoid locomotion and local-navigation across 3d constrained terrains, 2025. URLhttps://arxiv.org/abs/2511.14625

arXiv 2025
[6]

Zhang, Y

Y . Zhang, Y . Seo, J. Chen, Y . Yuan, K. Sreenath, P. Abbeel, C. Sferrazza, K. Liu, R. Duan, and G. Shi. Rpl: Learning robust humanoid perceptive locomotion on challenging terrains.arXiv preprint arXiv:2602.03002, 2026

arXiv 2026
[7]

K. Li, Z. Olkin, Y . Yue, and A. D. Ames. Clf-rl: Control lyapunov function guided rein- forcement learning.IEEE Robotics and Automation Letters, 11(3):3230–3237, 2026. doi: 10.1109/LRA.2026.3653329

work page doi:10.1109/lra.2026.3653329 2026
[8]

M. Dai, W. D. Compton, J. Li, L. Yang, and A. D. Ames. Walk the planc: Physics-guided rl for agile humanoid locomotion on constrained footholds.arXiv preprint arXiv:2601.06286, 2026

arXiv 2026
[9]

C. M. Bishop. Mixture density networks. Copyright © 1994, Christopher M. Bishop. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/)., 1994. URL https://publications.aston.ac.uk/id/eprint/373/

1994
[10]

Nguyen and K

Q. Nguyen and K. Sreenath. Safety-critical control for dynamical bipedal walking with precise footstep placement.IF AC-PapersOnLine, 48(27):147–154, 2015. ISSN 2405-8963. doi:https: //doi.org/10.1016/j.ifacol.2015.11.167. Analysis and Design of Hybrid Systems ADHS

work page doi:10.1016/j.ifacol.2015.11.167 2015
[11]

A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada. Control barrier functions: theory and applications. In2019 18th European Control Conference (ECC), pages 3420–3431, 2019. doi:10.23919/ECC.2019.8796030

work page doi:10.23919/ecc.2019.8796030 2019
[12]

Grandia, F

R. Grandia, F. Jenelten, S. Yang, F. Farshidian, and M. Hutter. Perceptive locomotion through nonlinear model-predictive control.IEEE Transactions on Robotics, 39(5):3402–3421, 2023

2023
[13]

Zhuang, S

Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 1975–1991. PMLR, 06–09 Nov 2025. URLhttps://proceedings.mlr.press/v270/zhuang25a.html

1975
[14]

Cheng, K

X. Cheng, K. Shi, A. Agarwal, and D. Pathak. Extreme parkour with legged robots. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 11443–11450,
[15]

doi:10.1109/ICRA57147.2024.10610200

work page doi:10.1109/icra57147.2024.10610200 2024
[16]

Agarwal, A

A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. InConference on robot learning, pages 403–415. PMLR, 2023

2023
[17]

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomo- tion over challenging terrain.Science Robotics, 5(47), 2020

2020
[18]

H. Duan, A. Malik, J. Dao, A. Saxena, K. Green, J. Siekmann, A. Fern, and J. Hurst. Sim-to- real learning of footstep-constrained bipedal dynamic walking. In2022 International Confer- ence on Robotics and Automation (ICRA), pages 10428–10434. IEEE, 2022

2022
[19]

J. Sun, G. Han, P. Sun, W. Zhao, J. Cao, J. Wang, Y . Guo, and Q. Zhang. Dpl: Depth- only perceptive humanoid locomotion via realistic depth synthesis and cross-attention terrain reconstruction, 2025. URLhttps://arxiv.org/abs/2510.07152

arXiv 2025
[20]

W. Sun, B. Cao, L. Chen, Y . Su, Y . Liu, Z. Xie, and H. Liu. Learning perceptive humanoid locomotion over challenging terrain. In2025 IEEE/RSJ International Conference on Intel- ligent Robots and Systems (IROS), pages 6571–6578, 2025. doi:10.1109/IROS60139.2025. 11247685

work page doi:10.1109/iros60139.2025 2025
[21]

S. Zhu, Z. Zhuang, M. Zhao, K.-Y . Lee, and H. Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids, 2026. URLhttps://arxiv.org/abs/2601.07718

arXiv 2026
[22]

Artstein

Z. Artstein. Stabilization with relaxed controls.Nonlinear Analysis: Theory, Methods & Applications, 7(11):1163–1173, 1983

1983
[23]

Olkin, W

Z. Olkin, W. D. Compton, R. M. Bena, and A. D. Ames. Chasing autonomy: Dynamic retar- geting and control guided rl for performant and controllable humanoid running, 2026. URL https://arxiv.org/abs/2603.25902

arXiv 2026
[24]

Janwani, V

N. Janwani, V . Madabushi, and M. Tucker. Navigait: Navigating dynamically feasible gait libraries using deep reinforcement learning, 2026. URLhttps://arxiv.org/abs/2510. 11542

2026
[25]

Xiong and A

X. Xiong and A. Ames. 3-d underactuated bipedal walking via h-lip based gait synthesis and stepping stabilization.IEEE Transactions on Robotics, 38(4):2405–2425, 2022. doi: 10.1109/TRO.2022.3150219

work page doi:10.1109/tro.2022.3150219 2022
[26]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[27]

K. J. ˚Astr¨om and R. Murray.Feedback systems: An introduction for scientists and engineers. Princeton University Press, 2021

2021
[28]

A. D. Ames, K. Galloway, K. Sreenath, and J. W. Grizzle. Rapidly exponentially stabiliz- ing control lyapunov functions and hybrid zero dynamics.IEEE Transactions on Automatic Control, 59(4):876–891, 2014

2014
[29]

E. D. Sontag and Y . Wang. On characterizations of the input-to-state stability property.Systems & Control Letters, 24(5):351–359, 1995

1995
[30]

Zakka, Q

K. Zakka, Q. Liao, B. Yi, L. L. Lay, K. Sreenath, and P. Abbeel. mjlab: A lightweight framework for gpu-accelerated robot learning, 2026. URLhttps://arxiv.org/abs/2601. 22074

2026
[31]

foot-target distance

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026– 5033, 2012. doi:10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012
[32]

Schwarke, M

C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter. Rsl-rl: A learning library for robotics research, 2025. URLhttps://arxiv.org/abs/2509.10771. 9 Appendix 9.1 Artificial Intelligence Acknowledgement The authors acknowledge the use of artificial intelligence (AI) technologies in the preparation of this paper. Specifically, large language models w...

arXiv 2025

[1] [1]

Nguyen, A

Q. Nguyen, A. Hereid, J. W. Grizzle, A. D. Ames, and K. Sreenath. 3d dynamic walking on stepping stones with control barrier functions. In2016 IEEE 55th Conference on Decision and Control (CDC), pages 827–834. IEEE, 2016

2016

[2] [2]

Csomay-Shanklin, R

N. Csomay-Shanklin, R. K. Cosner, M. Dai, A. J. Taylor, and A. D. Ames. Episodic learning for safe bipedal locomotion with control barrier functions and projection-to-state safety. In Proceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144 of Proceedings of Machine Learning Research, pages 1041–1053. PMLR, 07 – 08 June 2021. URL...

2021

[3] [3]

Grandia, A

R. Grandia, A. J. Taylor, A. D. Ames, and M. Hutter. Multi-layered safety for legged robots via control barrier functions and model predictive control. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 8352–8358. IEEE, 2021

2021

[4] [4]

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust per- ceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62), 2022

2022

[5] [5]

Q. Ben, B. Xu, K. Li, F. Jia, W. Zhang, J. Wang, J. Wang, D. Lin, and J. Pang. Gallant: V oxel grid-based humanoid locomotion and local-navigation across 3d constrained terrains, 2025. URLhttps://arxiv.org/abs/2511.14625

arXiv 2025

[6] [6]

Zhang, Y

Y . Zhang, Y . Seo, J. Chen, Y . Yuan, K. Sreenath, P. Abbeel, C. Sferrazza, K. Liu, R. Duan, and G. Shi. Rpl: Learning robust humanoid perceptive locomotion on challenging terrains.arXiv preprint arXiv:2602.03002, 2026

arXiv 2026

[7] [7]

K. Li, Z. Olkin, Y . Yue, and A. D. Ames. Clf-rl: Control lyapunov function guided rein- forcement learning.IEEE Robotics and Automation Letters, 11(3):3230–3237, 2026. doi: 10.1109/LRA.2026.3653329

work page doi:10.1109/lra.2026.3653329 2026

[8] [8]

M. Dai, W. D. Compton, J. Li, L. Yang, and A. D. Ames. Walk the planc: Physics-guided rl for agile humanoid locomotion on constrained footholds.arXiv preprint arXiv:2601.06286, 2026

arXiv 2026

[9] [9]

C. M. Bishop. Mixture density networks. Copyright © 1994, Christopher M. Bishop. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/)., 1994. URL https://publications.aston.ac.uk/id/eprint/373/

1994

[10] [10]

Nguyen and K

Q. Nguyen and K. Sreenath. Safety-critical control for dynamical bipedal walking with precise footstep placement.IF AC-PapersOnLine, 48(27):147–154, 2015. ISSN 2405-8963. doi:https: //doi.org/10.1016/j.ifacol.2015.11.167. Analysis and Design of Hybrid Systems ADHS

work page doi:10.1016/j.ifacol.2015.11.167 2015

[11] [11]

A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada. Control barrier functions: theory and applications. In2019 18th European Control Conference (ECC), pages 3420–3431, 2019. doi:10.23919/ECC.2019.8796030

work page doi:10.23919/ecc.2019.8796030 2019

[12] [12]

Grandia, F

R. Grandia, F. Jenelten, S. Yang, F. Farshidian, and M. Hutter. Perceptive locomotion through nonlinear model-predictive control.IEEE Transactions on Robotics, 39(5):3402–3421, 2023

2023

[13] [13]

Zhuang, S

Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 1975–1991. PMLR, 06–09 Nov 2025. URLhttps://proceedings.mlr.press/v270/zhuang25a.html

1975

[14] [14]

Cheng, K

X. Cheng, K. Shi, A. Agarwal, and D. Pathak. Extreme parkour with legged robots. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 11443–11450,

[15] [15]

doi:10.1109/ICRA57147.2024.10610200

work page doi:10.1109/icra57147.2024.10610200 2024

[16] [16]

Agarwal, A

A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. InConference on robot learning, pages 403–415. PMLR, 2023

2023

[17] [17]

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomo- tion over challenging terrain.Science Robotics, 5(47), 2020

2020

[18] [18]

H. Duan, A. Malik, J. Dao, A. Saxena, K. Green, J. Siekmann, A. Fern, and J. Hurst. Sim-to- real learning of footstep-constrained bipedal dynamic walking. In2022 International Confer- ence on Robotics and Automation (ICRA), pages 10428–10434. IEEE, 2022

2022

[19] [19]

J. Sun, G. Han, P. Sun, W. Zhao, J. Cao, J. Wang, Y . Guo, and Q. Zhang. Dpl: Depth- only perceptive humanoid locomotion via realistic depth synthesis and cross-attention terrain reconstruction, 2025. URLhttps://arxiv.org/abs/2510.07152

arXiv 2025

[20] [20]

W. Sun, B. Cao, L. Chen, Y . Su, Y . Liu, Z. Xie, and H. Liu. Learning perceptive humanoid locomotion over challenging terrain. In2025 IEEE/RSJ International Conference on Intel- ligent Robots and Systems (IROS), pages 6571–6578, 2025. doi:10.1109/IROS60139.2025. 11247685

work page doi:10.1109/iros60139.2025 2025

[21] [21]

S. Zhu, Z. Zhuang, M. Zhao, K.-Y . Lee, and H. Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids, 2026. URLhttps://arxiv.org/abs/2601.07718

arXiv 2026

[22] [22]

Artstein

Z. Artstein. Stabilization with relaxed controls.Nonlinear Analysis: Theory, Methods & Applications, 7(11):1163–1173, 1983

1983

[23] [23]

Olkin, W

Z. Olkin, W. D. Compton, R. M. Bena, and A. D. Ames. Chasing autonomy: Dynamic retar- geting and control guided rl for performant and controllable humanoid running, 2026. URL https://arxiv.org/abs/2603.25902

arXiv 2026

[24] [24]

Janwani, V

N. Janwani, V . Madabushi, and M. Tucker. Navigait: Navigating dynamically feasible gait libraries using deep reinforcement learning, 2026. URLhttps://arxiv.org/abs/2510. 11542

2026

[25] [25]

Xiong and A

X. Xiong and A. Ames. 3-d underactuated bipedal walking via h-lip based gait synthesis and stepping stabilization.IEEE Transactions on Robotics, 38(4):2405–2425, 2022. doi: 10.1109/TRO.2022.3150219

work page doi:10.1109/tro.2022.3150219 2022

[26] [26]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[27] [27]

K. J. ˚Astr¨om and R. Murray.Feedback systems: An introduction for scientists and engineers. Princeton University Press, 2021

2021

[28] [28]

A. D. Ames, K. Galloway, K. Sreenath, and J. W. Grizzle. Rapidly exponentially stabiliz- ing control lyapunov functions and hybrid zero dynamics.IEEE Transactions on Automatic Control, 59(4):876–891, 2014

2014

[29] [29]

E. D. Sontag and Y . Wang. On characterizations of the input-to-state stability property.Systems & Control Letters, 24(5):351–359, 1995

1995

[30] [30]

Zakka, Q

K. Zakka, Q. Liao, B. Yi, L. L. Lay, K. Sreenath, and P. Abbeel. mjlab: A lightweight framework for gpu-accelerated robot learning, 2026. URLhttps://arxiv.org/abs/2601. 22074

2026

[31] [31]

foot-target distance

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026– 5033, 2012. doi:10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012

[32] [32]

Schwarke, M

C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter. Rsl-rl: A learning library for robotics research, 2025. URLhttps://arxiv.org/abs/2509.10771. 9 Appendix 9.1 Artificial Intelligence Acknowledgement The authors acknowledge the use of artificial intelligence (AI) technologies in the preparation of this paper. Specifically, large language models w...

arXiv 2025