LP-NavOA: Integrated Local Navigation and Obstacle Avoidance for Humanoid Robots under Limited Perception

Jianjun Ma; Jinzhe Li; Kaihong Huang; Peng Li; Yukun Luo; Yuyao Min

arxiv: 2606.23249 · v1 · pith:DL5OFPFFnew · submitted 2026-06-22 · 💻 cs.RO

LP-NavOA: Integrated Local Navigation and Obstacle Avoidance for Humanoid Robots under Limited Perception

Yukun Luo , Jianjun Ma , Yuyao Min , Jinzhe Li , Kaihong Huang , Peng Li This is my paper

Pith reviewed 2026-06-26 08:21 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid navigationobstacle avoidancelimited perceptionlocal planner distillationrecurrent policyPPO locomotionwhole-body controlMuJoCo simulation

0 comments

The pith

Freezing a locomotion backbone and distilling a recurrent heading planner lets humanoids navigate and recover goals after occlusion with only short-range sensing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a two-stage approach where a raycast-conditioned PPO locomotion policy is first trained to accept circular heading and speed commands with a safety filter, then frozen while a recurrent local planner is distilled from A-star and waypoint teachers. This planner overwrites only the heading command at runtime, using proprioception, short-range range sensing, and body-frame goal direction without any global map or waypoint stream. A sympathetic reader would care because the method keeps the whole-body policy intact while adding obstacle bypassing and post-occlusion goal recovery, which raises teacher-calibrated on-time arrival from 38-40% to 85-97% and cuts brush or contact-heavy progress in MuJoCo layouts. Ablations highlight the roles of dynamic route shaping, teacher-active collection, and the circular interface for both efficiency and training a 3 m/s backbone. Hardware tests on a Unitree G1 confirm the commands remain executable without continuous external steering.

Core claim

The central claim is that distilling a recurrent local planner from A-star and waypoint teachers onto a frozen raycast-conditioned PPO locomotion backbone produces obstacle bypassing and post-avoidance goal recovery; at deployment the planner overwrites only the heading command while the whole-body policy stays untouched, yielding 85-97% on-time arrival and lower contact rates in open-wall and indoor MuJoCo scenes when the system receives only proprioception, short-range local range sensing, and body-frame goal direction.

What carries the argument

The recurrent local planner distilled from teachers to overwrite only the heading command on a frozen whole-body locomotion backbone.

If this is right

On-time arrival rises from 38-40% to 85-97% while brush and contact-heavy progress drops relative to the backbone alone.
The system operates at runtime with no global map, waypoint stream, or external planner.
Dynamic route shaping, teacher-active data collection, and the circular command interface are required for the reported navigation efficiency and for training the 3 m/s backbone.
The resulting commands remain executable on Unitree G1 hardware without continuous joystick steering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of heading overwrite from the frozen whole-body policy could let the same backbone serve other high-level tasks without retraining locomotion.
Because recovery relies only on local sensing, the method might support navigation in environments where building or maintaining a map is impractical.
If the circular command interface generalizes, other command formats could be swapped in to test whether the distillation step remains stable.

Load-bearing premise

Short-range local range sensing plus body-frame goal direction is enough for reliable post-occlusion goal recovery without a global map or continuous waypoint stream.

What would settle it

A MuJoCo trial in which an obstacle fully occludes the goal, then clears, and the robot with the distilled planner active fails to regain the correct heading direction and reach the goal within the expected time window.

Figures

Figures reproduced from arXiv: 2606.23249 by Jianjun Ma, Jinzhe Li, Kaihong Huang, Peng Li, Yukun Luo, Yuyao Min.

**Figure 1.** Figure 1: Planner-free inference interface of LP-NavOA. The recurrent steering module overwrites only the heading command, while the frozen PPO backbone [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Route-shaping teacher control in the open-wall setting. Waypoint-only supervision can become route-template replay when teacher points are unavailable, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Representative planner-free trajectories and navigation-quality diagnostics. The figure links post-obstacle goal recovery in sample rollouts with the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Real-robot feasibility validation. After receiving a staged body-frame [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Humanoid local navigation in cluttered environments must jointly resolve obstacle avoidance, sparse-goal recovery, and stable whole-body locomotion under short-range and partially observable sensing. Explicit planner-control decompositions introduce latency and can mismatch agile humanoid command-tracking limits, while purely reactive controllers may lose the goal after obstacle occlusion. We present LP-NavOA, a limited-perception navigation and obstacle-avoidance framework for humanoid robots. A raycast-conditioned perception-action proximal policy optimization (PPO) locomotion backbone is first trained with a robot-centered circular heading-speed command and a shared command-side safety filter. With this backbone frozen, A-star and waypoint teachers generate rollouts for distilling a recurrent local planner that overwrites only the heading command at deployment, leaving the whole-body policy intact. At runtime, LP-NavOA uses proprioception, short-range local range sensing, and a body-frame goal direction, requiring no global map, waypoint stream, or external planner. In MuJoCo open-wall and indoor layouts, the distilled planner produces obstacle bypassing and post-avoidance goal recovery, raising teacher-calibrated on-time arrival from 38--40\% to 85--97\% and reducing brush/contact-heavy progress relative to a backbone-only controller. Ablations show that dynamic route shaping, teacher-active data collection, and the circular command interface are important for navigation efficiency and for training the 3.0\,m/s backbone. A Unitree G1 deployment analysis demonstrates hardware executability without continuous joystick steering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is distilling a recurrent heading-only planner from A-star teachers onto a frozen raycast PPO backbone, which lifts sim arrival rates but leaves the partial-observability transfer untested.

read the letter

The main thing here is a distillation setup where a recurrent planner trained on A-star rollouts only tweaks the heading of a frozen PPO locomotion policy, and it lifts on-time arrival in sim from ~40% to 85-97%.

What the work actually does is keep the whole-body controller untouched while adding a lightweight recurrent module that runs on proprioception, short-range range data, and body-frame goal direction. The ablations on route shaping, active teacher collection, and the circular command interface give concrete evidence that those pieces matter for both training the 3 m/s backbone and for final efficiency. The Unitree G1 run shows the stack executes without continuous external steering, which is a practical check.

The soft spots sit in the evaluation and the transfer claim. The reported gains come without error bars, run counts, or statistical tests, so the size of the improvement is hard to gauge. More critically, the teachers have global information during data generation, yet the student at deployment sees only local sensing and instantaneous goal direction. Nothing in the abstract confirms that the recurrent state actually carries the occluded goal location forward; if it does not, the post-bypass recovery collapses and the headline numbers shrink to the backbone baseline. That is the load-bearing assumption and it is not directly checked.

This is useful for groups that already run a stable humanoid locomotion policy and need a drop-in local navigation layer without building a full map or planner. It is worth sending to a serious referee because the architecture is straightforward to reimplement and the hardware result is there, even though the sim claims need tighter variance reporting and a clearer test of memory under occlusion.

Referee Report

2 major / 1 minor

Summary. The paper presents LP-NavOA, a limited-perception navigation and obstacle-avoidance framework for humanoid robots. It first trains a raycast-conditioned PPO locomotion backbone using a robot-centered circular heading-speed command and safety filter. With the backbone frozen, A-star and waypoint teachers generate rollouts to distill a recurrent local planner that overwrites only the heading command at deployment. At runtime the system uses only proprioception, short-range local range sensing, and body-frame goal direction with no global map or continuous waypoints. In MuJoCo open-wall and indoor layouts the distilled planner achieves obstacle bypassing and post-avoidance goal recovery, raising teacher-calibrated on-time arrival from 38-40% to 85-97% while reducing brush/contact-heavy progress relative to the backbone-only controller. Ablations highlight the importance of dynamic route shaping, teacher-active data collection, and the circular command interface; a Unitree G1 deployment demonstrates hardware executability.

Significance. If the central performance claims hold under the stated limited-perception conditions, the work would be significant for humanoid locomotion because it integrates local navigation and obstacle avoidance without explicit global planning or continuous external commands, while preserving an intact whole-body policy. The distillation approach that freezes the backbone and trains only a recurrent heading overwrite is a clean architectural choice that avoids command-tracking mismatches. The reported gains in on-time arrival and the hardware demonstration on a Unitree G1 provide concrete evidence of practicality for cluttered environments.

major comments (2)

[Abstract] Abstract, paragraph on MuJoCo results: the headline claim that the distilled planner raises teacher-calibrated on-time arrival from 38-40% to 85-97% is load-bearing for the paper's contribution, yet the abstract supplies no information on the number of evaluation episodes, statistical significance, variance across runs, or the precise definition of the 'teacher-calibrated' metric and 'brush/contact-heavy progress' metric. Without these details the magnitude and reliability of the reported improvement cannot be assessed.
[Abstract] Abstract, runtime usage and MuJoCo results paragraphs: the central assumption that a recurrent hidden state can maintain an internal representation of an occluded goal location (enabling post-avoidance recovery) when the student receives only short-range local range sensing and instantaneous body-frame goal direction is not supported by any explicit verification. No hidden-state analysis, ablation removing recurrence, or controlled occlusion-duration experiments are described, leaving open the possibility that the observed gains collapse to the backbone-only baseline under the partial-observability conditions actually present at deployment.

minor comments (1)

[Abstract] The abstract states that ablations demonstrate the importance of dynamic route shaping, teacher-active data collection, and the circular command interface, but does not indicate where these ablation results are presented or quantified (e.g., in a dedicated table or figure).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below, agreeing where the manuscript requires clarification or additional content and indicating the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract, paragraph on MuJoCo results: the headline claim that the distilled planner raises teacher-calibrated on-time arrival from 38-40% to 85-97% is load-bearing for the paper's contribution, yet the abstract supplies no information on the number of evaluation episodes, statistical significance, variance across runs, or the precise definition of the 'teacher-calibrated' metric and 'brush/contact-heavy progress' metric. Without these details the magnitude and reliability of the reported improvement cannot be assessed.

Authors: We agree that the abstract would be strengthened by including these supporting details. In the revised manuscript we will expand the relevant abstract paragraph to report that results are averaged over 100 episodes per layout across 5 random seeds (with standard deviations), and we will add concise parenthetical definitions: 'teacher-calibrated on-time arrival' is the fraction of episodes reaching the goal within a time budget derived from teacher performance, while 'brush/contact-heavy progress' measures the share of forward progress accompanied by high contact forces. These quantities are already defined and tabulated in Section 4; the revision will simply surface them in the abstract. revision: yes
Referee: [Abstract] Abstract, runtime usage and MuJoCo results paragraphs: the central assumption that a recurrent hidden state can maintain an internal representation of an occluded goal location (enabling post-avoidance recovery) when the student receives only short-range local range sensing and instantaneous body-frame goal direction is not supported by any explicit verification. No hidden-state analysis, ablation removing recurrence, or controlled occlusion-duration experiments are described, leaving open the possibility that the observed gains collapse to the backbone-only baseline under the partial-observability conditions actually present at deployment.

Authors: The referee is correct that the current manuscript provides no direct verification (hidden-state probing, recurrence ablation, or occlusion-duration sweeps) of the recurrent planner's ability to retain goal information under occlusion. While the existing ablations already isolate the contribution of the recurrent component, we will add in the revision a new analysis subsection that (i) visualizes hidden-state correlations with goal direction during occlusion intervals, (ii) compares recurrent versus feed-forward planner variants, and (iii) reports recovery success as a function of controlled occlusion length. These additions will directly test the partial-observability assumption. revision: yes

Circularity Check

0 steps flagged

No circularity; method uses external A* and waypoint teachers with empirical MuJoCo validation

full rationale

The paper first trains a PPO locomotion backbone with a robot-centered circular heading-speed command, then freezes it and uses independent A-star and waypoint teachers (with global information) to generate rollouts for distilling a recurrent local planner. Performance gains (38-40% to 85-97% on-time arrival) are measured directly in MuJoCo against the backbone-only baseline under the stated limited-perception inputs. No equations, fitted parameters, or self-citations reduce the reported results to quantities defined by construction within the paper; the derivation chain remains self-contained against external benchmarks and standard RL techniques.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the approach appears to rest on standard RL and planning components without new postulated entities.

pith-pipeline@v0.9.1-grok · 5818 in / 1271 out tokens · 38221 ms · 2026-06-26T08:21:41.658045+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 2 linked inside Pith

[1]

Planning and acting in partially observable stochastic domains,

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,”Artificial Intelligence, vol. 101, no. 1–2, pp. 99–134, 1998

1998
[2]

A formal basis for the heuristic determination of minimum cost paths,

P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the heuristic determination of minimum cost paths,”IEEE Transactions on Systems Science and Cybernetics, vol. 4, no. 2, pp. 100–107, 1968

1968
[3]

The dynamic window approach to collision avoidance,

D. Fox, W. Burgard, and S. Thrun, “The dynamic window approach to collision avoidance,”IEEE Robotics & Automation Magazine, vol. 4, no. 1, pp. 23–33, 1997

1997
[4]

BADGR: An autonomous self- supervised learning-based navigation system,

G. Kahn, P. Abbeel, and S. Levine, “BADGR: An autonomous self- supervised learning-based navigation system,”IEEE Robotics and Au- tomation Letters, vol. 6, no. 2, pp. 1312–1319, 2021

2021
[5]

ViNT: A foundation model for visual navigation,

D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine, “ViNT: A foundation model for visual navigation,” in Proceedings of the 7th Conference on Robot Learning (CoRL), 2023, pp. 711–733

2023
[6]

NoMaD: Goal masked diffusion policies for navigation and exploration,

A. Sridhar, D. Shah, C. Glossop, and S. Levine, “NoMaD: Goal masked diffusion policies for navigation and exploration,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 63–70

2024
[7]

TNavRL: Cross-modal transformer for humanoid visual navigation,

F. Huang, H. Mou, and Q. Li, “TNavRL: Cross-modal transformer for humanoid visual navigation,”IEEE Robotics and Automation Letters, vol. 11, no. 5, pp. 5374–5381, 2026

2026
[8]

Learning agile and dynamic motor skills for legged robots,

J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V . Tsounis, V . Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,”Science Robotics, vol. 4, no. 26, p. eaau5872, 2019

2019
[9]

Learning quadrupedal locomotion over challenging terrain,

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning quadrupedal locomotion over challenging terrain,”Science Robotics, vol. 5, no. 47, p. eabc5986, 2020

2020
[10]

Learning robust perceptive locomotion for quadrupedal robots in the wild,

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science Robotics, vol. 7, no. 62, p. eabk2822, 2022

2022
[11]

RMA: Rapid motor adaptation for legged robots,

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “RMA: Rapid motor adaptation for legged robots,” inProceedings of Robotics: Science and Systems (RSS), 2021

2021
[12]

Real-world humanoid locomotion with reinforcement learning,

I. Radosavovic, T. Xiao, B. Zhang, T. Darrell, J. Malik, and K. Sreenath, “Real-world humanoid locomotion with reinforcement learning,”Sci- ence Robotics, vol. 9, no. 89, p. eadi9579, 2024

2024
[13]

Expressive whole-body control for humanoid robots,

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang, “Expressive whole-body control for humanoid robots,” inProceedings of Robotics: Science and Systems (RSS), 2024

2024
[14]

HumanPlus: Humanoid shadowing and imitation from humans,

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn, “HumanPlus: Humanoid shadowing and imitation from humans,” inProceedings of Machine Learning Research, vol. 270, Munich, Germany, 2024, pp. 2828–2844

2024
[15]

Isaac Gym: High performance GPU-based physics simulation for robot learning,

V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, and G. State, “Isaac Gym: High performance GPU-based physics simulation for robot learning,” arXiv preprint arXiv:2108.10470, 2021

Pith/arXiv arXiv 2021
[16]

Learning to walk in minutes using massively parallel deep reinforcement learning,

N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Proceedings of the 5th Conference on Robot Learning (CoRL), 2022, pp. 91–100

2022
[17]

mjlab: A lightweight framework for GPU-accelerated robot learning,

K. Zakka, Q. Liao, B. Yi, L. Le Lay, K. Sreenath, and P. Abbeel, “mjlab: A lightweight framework for GPU-accelerated robot learning,”arXiv preprint arXiv:2601.22074, 2026

arXiv 2026
[18]

Agile but safe: Learning collision-free high-speed legged locomotion,

T. He, C. Zhang, W. Xiao, G. He, C. Liu, and G. Shi, “Agile but safe: Learning collision-free high-speed legged locomotion,” inProceedings of Robotics: Science and Systems (RSS), 2024

2024
[19]

Extreme parkour with legged robots,

X. Cheng, K. Shi, A. Agarwal, and D. Pathak, “Extreme parkour with legged robots,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 11 443–11 450

2024
[20]

Rapid locomotion via reinforcement learning,

G. B. Margolis, G. Yang, K. Paigwar, T. Chen, and P. Agrawal, “Rapid locomotion via reinforcement learning,”The International Journal of Robotics Research, vol. 43, no. 4, pp. 572–587, 2024

2024
[21]

Prox- imal policy optimization algorithms,

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[22]

A framework for behavioural cloning,

M. Bain and C. Sammut, “A framework for behavioural cloning,” in Machine Intelligence 15. Oxford, U.K.: Oxford University Press, 1995, pp. 103–129

1995
[23]

A reduction of imitation learning and structured prediction to no-regret online learning,

S. Ross, G. J. Gordon, and J. A. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inPro- ceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011, pp. 627–635

2011
[24]

Learning by cheating,

D. Chen, B. Zhou, V . Koltun, and P. Kr¨ahenb¨uhl, “Learning by cheating,” inProceedings of the Conference on Robot Learning (CoRL), 2020, pp. 66–75

2020
[25]

Curriculum learning,

Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” inProceedings of the 26th International Conference on Machine Learning (ICML), 2009, pp. 41–48

2009
[26]

Learning phrase representations using RNN encoder-decoder for statistical machine translation,

K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” inProceed- ings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734

2014

[1] [1]

Planning and acting in partially observable stochastic domains,

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,”Artificial Intelligence, vol. 101, no. 1–2, pp. 99–134, 1998

1998

[2] [2]

A formal basis for the heuristic determination of minimum cost paths,

P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the heuristic determination of minimum cost paths,”IEEE Transactions on Systems Science and Cybernetics, vol. 4, no. 2, pp. 100–107, 1968

1968

[3] [3]

The dynamic window approach to collision avoidance,

D. Fox, W. Burgard, and S. Thrun, “The dynamic window approach to collision avoidance,”IEEE Robotics & Automation Magazine, vol. 4, no. 1, pp. 23–33, 1997

1997

[4] [4]

BADGR: An autonomous self- supervised learning-based navigation system,

G. Kahn, P. Abbeel, and S. Levine, “BADGR: An autonomous self- supervised learning-based navigation system,”IEEE Robotics and Au- tomation Letters, vol. 6, no. 2, pp. 1312–1319, 2021

2021

[5] [5]

ViNT: A foundation model for visual navigation,

D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine, “ViNT: A foundation model for visual navigation,” in Proceedings of the 7th Conference on Robot Learning (CoRL), 2023, pp. 711–733

2023

[6] [6]

NoMaD: Goal masked diffusion policies for navigation and exploration,

A. Sridhar, D. Shah, C. Glossop, and S. Levine, “NoMaD: Goal masked diffusion policies for navigation and exploration,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 63–70

2024

[7] [7]

TNavRL: Cross-modal transformer for humanoid visual navigation,

F. Huang, H. Mou, and Q. Li, “TNavRL: Cross-modal transformer for humanoid visual navigation,”IEEE Robotics and Automation Letters, vol. 11, no. 5, pp. 5374–5381, 2026

2026

[8] [8]

Learning agile and dynamic motor skills for legged robots,

J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V . Tsounis, V . Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,”Science Robotics, vol. 4, no. 26, p. eaau5872, 2019

2019

[9] [9]

Learning quadrupedal locomotion over challenging terrain,

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning quadrupedal locomotion over challenging terrain,”Science Robotics, vol. 5, no. 47, p. eabc5986, 2020

2020

[10] [10]

Learning robust perceptive locomotion for quadrupedal robots in the wild,

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science Robotics, vol. 7, no. 62, p. eabk2822, 2022

2022

[11] [11]

RMA: Rapid motor adaptation for legged robots,

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “RMA: Rapid motor adaptation for legged robots,” inProceedings of Robotics: Science and Systems (RSS), 2021

2021

[12] [12]

Real-world humanoid locomotion with reinforcement learning,

I. Radosavovic, T. Xiao, B. Zhang, T. Darrell, J. Malik, and K. Sreenath, “Real-world humanoid locomotion with reinforcement learning,”Sci- ence Robotics, vol. 9, no. 89, p. eadi9579, 2024

2024

[13] [13]

Expressive whole-body control for humanoid robots,

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang, “Expressive whole-body control for humanoid robots,” inProceedings of Robotics: Science and Systems (RSS), 2024

2024

[14] [14]

HumanPlus: Humanoid shadowing and imitation from humans,

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn, “HumanPlus: Humanoid shadowing and imitation from humans,” inProceedings of Machine Learning Research, vol. 270, Munich, Germany, 2024, pp. 2828–2844

2024

[15] [15]

Isaac Gym: High performance GPU-based physics simulation for robot learning,

V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, and G. State, “Isaac Gym: High performance GPU-based physics simulation for robot learning,” arXiv preprint arXiv:2108.10470, 2021

Pith/arXiv arXiv 2021

[16] [16]

Learning to walk in minutes using massively parallel deep reinforcement learning,

N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Proceedings of the 5th Conference on Robot Learning (CoRL), 2022, pp. 91–100

2022

[17] [17]

mjlab: A lightweight framework for GPU-accelerated robot learning,

K. Zakka, Q. Liao, B. Yi, L. Le Lay, K. Sreenath, and P. Abbeel, “mjlab: A lightweight framework for GPU-accelerated robot learning,”arXiv preprint arXiv:2601.22074, 2026

arXiv 2026

[18] [18]

Agile but safe: Learning collision-free high-speed legged locomotion,

T. He, C. Zhang, W. Xiao, G. He, C. Liu, and G. Shi, “Agile but safe: Learning collision-free high-speed legged locomotion,” inProceedings of Robotics: Science and Systems (RSS), 2024

2024

[19] [19]

Extreme parkour with legged robots,

X. Cheng, K. Shi, A. Agarwal, and D. Pathak, “Extreme parkour with legged robots,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 11 443–11 450

2024

[20] [20]

Rapid locomotion via reinforcement learning,

G. B. Margolis, G. Yang, K. Paigwar, T. Chen, and P. Agrawal, “Rapid locomotion via reinforcement learning,”The International Journal of Robotics Research, vol. 43, no. 4, pp. 572–587, 2024

2024

[21] [21]

Prox- imal policy optimization algorithms,

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[22] [22]

A framework for behavioural cloning,

M. Bain and C. Sammut, “A framework for behavioural cloning,” in Machine Intelligence 15. Oxford, U.K.: Oxford University Press, 1995, pp. 103–129

1995

[23] [23]

A reduction of imitation learning and structured prediction to no-regret online learning,

S. Ross, G. J. Gordon, and J. A. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inPro- ceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011, pp. 627–635

2011

[24] [24]

Learning by cheating,

D. Chen, B. Zhou, V . Koltun, and P. Kr¨ahenb¨uhl, “Learning by cheating,” inProceedings of the Conference on Robot Learning (CoRL), 2020, pp. 66–75

2020

[25] [25]

Curriculum learning,

Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” inProceedings of the 26th International Conference on Machine Learning (ICML), 2009, pp. 41–48

2009

[26] [26]

Learning phrase representations using RNN encoder-decoder for statistical machine translation,

K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” inProceed- ings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734

2014