arxiv: 2604.15289 · v1 · submitted 2026-04-16 · 💻 cs.RO

Recognition: unknown

Abstract Sim2Real through Approximate Information States

Yunfu Deng , Yuhao Li , Josiah P. Hanna

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:37 UTC · model grok-4.3

classification 💻 cs.RO

keywords sim2realreinforcement learningstate abstractionroboticspolicy transfersimulator correctionabstract simulation

0 comments

The pith

An abstract simulator can be grounded to the real world if its dynamics account for state history and are corrected with real task data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper formalizes the abstract sim2real problem, in which an abstract simulator models a robotics task only at a coarse level. It shows that grounding succeeds when the abstract dynamics are made to depend on the full history of states rather than the current state alone. The authors then present a method that uses real-world task data to adjust those history-dependent dynamics. Experiments demonstrate that policies trained in the corrected simulator transfer successfully both in simulation-to-simulation and simulation-to-real settings.

Core claim

Using the language of state abstraction from reinforcement learning, the paper establishes that an abstract simulator matches the target task when its grounded dynamics incorporate the history of states. A correction procedure is introduced that updates the abstract dynamics from real-world task data, after which reinforcement learning in the corrected simulator produces policies that transfer to the real world.

What carries the argument

Grounded abstract dynamics that depend on the full history of states, derived from state-abstraction formalism in RL, to compensate for details omitted by the coarse simulator.

If this is right

Policies trained with reinforcement learning in the corrected abstract simulator transfer to the real world.
The same correction approach improves transfer in sim2sim settings as well as sim2real settings.
Accounting for state history in the abstract dynamics is necessary to bridge the gap created by simulator abstraction.
Real-world data can be used directly to adjust simulator dynamics rather than to train policies from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framing suggests that many existing coarse simulators could be made usable for policy transfer by adding a lightweight history-dependent correction layer.
The amount of real data needed may be smaller than for full system identification because only the abstract mismatch must be learned.
The approach could extend to other sequential decision problems where simulators are necessarily incomplete.

Load-bearing premise

Real-world task data is sufficient to correct the abstract dynamics accurately enough that a policy trained in the corrected simulator will transfer to the target task.

What would settle it

A policy trained in the history-corrected abstract simulator fails to transfer successfully to the real world even after the dynamics have been updated with real task data.

Figures

Figures reproduced from arXiv: 2604.15289 by Josiah P. Hanna, Yuhao Li, Yunfu Deng.

**Figure 2.** Figure 2: Illustrations of trajectories learned by different approaches when transferring from PointMaze (abstract point-mass) to AntMaze (quadruped locomotion). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Success rates on U-Maze (left) and Long Maze (right) navigation tasks. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Abstraction hierarchy used for humanoid locomotion experiments: [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Humanoid locomotion results across three abstraction levels (10 seeds; [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Dataset efficiency analysis. Top: position coverage heatmap show [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

In recent years, reinforcement learning (RL) has shown remarkable success in robotics when a fast and accurate simulator is available for a given task. When using RL and simulation, more simulator realism is generally beneficial but becomes harder to obtain as robots are deployed in increasingly complex and widescale domains. In such settings, simulators will likely fail to model all relevant details of a given target task and this observation motivates the study of sim2real with simulators that leave out key task details. In this paper, we formalize and study the abstract sim2real problem: given an abstract simulator that models a target task at a coarse level of abstraction, how can we train a policy with RL in the abstract simulator and successfully transfer it to the real-world? Our first contribution is to formalize this problem using the language of state abstraction from the RL literature. This framing shows that an abstract simulator can be grounded to match the target task if the grounded abstract dynamics take the history of states into account. Based on the formalism, we then introduce a method that uses real-world task data to correct the dynamics of the abstract simulator. We then show that this method enables successful policy transfer both in sim2sim and sim2real evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper frames abstract sim2real as a state-abstraction issue and gives a history-dependent correction from real data that transfers in their tests, but the data-coverage assumption is the unproven part.

read the letter

The main takeaway is that coarse simulators missing key details can still support policy transfer if you ground their dynamics using real trajectories and allow the grounding to depend on state history. They formalize this with approximate information states from the RL literature, which cleanly shows why memoryless corrections fall short. Then they introduce a data-driven fix and test it in both sim2sim and sim2real settings, where it enables successful transfer on the tasks they ran. That is the concrete advance: a usable method tied to an existing abstraction framework rather than another ad-hoc domain randomization trick. The experiments give evidence that the approach closes the gap in practice for the cases they examined. The soft spot is exactly the one the stress-test flags. Real data has to cover the states the final policy will visit; if it does not, residual mismatch in the corrected dynamics can still break transfer. The paper does not bound how much data or what coverage is required, so the result is empirical rather than guaranteed. This work is aimed at robotics RL groups that already run policies in low-fidelity simulators and need a principled way to incorporate limited real traces. A reader looking for transfer methods grounded in theory would find the framing and the correction procedure worth reading. It deserves peer review because the formalization is new, the method is reproducible from the description, and the empirical check is positive even if the generalization question remains open.

Referee Report

2 major / 2 minor

Summary. The paper formalizes the abstract sim2real problem in robotics RL: given a coarse abstract simulator that omits key task details, how to train an RL policy in it that transfers to the real world. Using state abstraction, it shows that grounding requires history-dependent abstract dynamics. It then proposes a method to correct those dynamics from real-world task trajectories and reports successful policy transfer in both sim2sim and sim2real experiments.

Significance. If the correction procedure reliably produces history-dependent abstract dynamics that generalize beyond the collected trajectories, the work would meaningfully lower the barrier to RL in complex robotics domains by permitting the use of fast but incomplete simulators. The state-abstraction framing supplies a clean conceptual tool for analyzing simulator-reality mismatch.

major comments (2)

[§3] §3 (Formalism): The claim that history-dependent grounding suffices to match the target task is stated but not accompanied by a bound on the residual approximation error in the information state after correction from finite real trajectories; without such a bound or a concrete counter-example analysis, it is unclear whether the formalism guarantees transfer for policies that visit states outside the support of the collected data.
[§4 and §5] §4 (Correction method) and §5 (Experiments): The procedure that updates the abstract dynamics from real-world task data is presented as sufficient for transfer, yet the manuscript provides no ablation on data volume, bias, or coverage of policy-induced state distributions; this directly tests the weakest assumption that limited real trajectories will produce an approximate information state close enough for RL policies to transfer without post-hoc fitting.

minor comments (2)

[Abstract] The abstract is dense and would benefit from a single illustrative diagram of the history-dependent grounding step.
[§2] Notation for the approximate information state is introduced without an explicit comparison table to prior state-abstraction definitions (e.g., those in Li et al. or Abel et al.); adding such a table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. The feedback identifies important gaps in the theoretical analysis and empirical validation of our approach to abstract sim2real transfer. We respond to each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [§3] §3 (Formalism): The claim that history-dependent grounding suffices to match the target task is stated but not accompanied by a bound on the residual approximation error in the information state after correction from finite real trajectories; without such a bound or a concrete counter-example analysis, it is unclear whether the formalism guarantees transfer for policies that visit states outside the support of the collected data.

Authors: We agree that the current formalism shows sufficiency of history-dependent abstract dynamics for recovering the target information state in the infinite-data limit but does not supply a finite-sample bound on residual error or an explicit counter-example analysis for out-of-support states. This is a genuine limitation of the theoretical development. In the revision we will expand the discussion in §3 to explicitly state the infinite-data assumption, clarify that finite-trajectory correction produces only an approximation, and include a short paragraph on potential failure modes when policies visit states outside the collected data support. Deriving a general PAC-style bound lies beyond the scope of the present work. revision: partial
Referee: [§4 and §5] §4 (Correction method) and §5 (Experiments): The procedure that updates the abstract dynamics from real-world task data is presented as sufficient for transfer, yet the manuscript provides no ablation on data volume, bias, or coverage of policy-induced state distributions; this directly tests the weakest assumption that limited real trajectories will produce an approximate information state close enough for RL policies to transfer without post-hoc fitting.

Authors: The referee correctly notes that our experiments report successful transfer but omit systematic ablations on real-world data volume, collection bias, and coverage of the state distributions induced by the learned policies. These omissions leave the core practical assumption under-tested. We will revise §5 to add new ablation experiments that vary the number of real trajectories used for dynamics correction, report transfer performance as a function of data volume, and include quantitative analysis of state-distribution coverage between the collected trajectories and the final policy rollouts. revision: yes

Circularity Check

0 steps flagged

No circularity in the derivation chain of abstract sim2real formalization

full rationale

The paper's abstract describes formalizing the abstract sim2real problem using state abstraction from the RL literature. This framing leads to the observation that grounded abstract dynamics should account for state history. A method is introduced to use real-world task data to correct the abstract simulator's dynamics, with evaluations showing successful policy transfer in sim2sim and sim2real settings. No equations are provided in the abstract, and no derivation chain reduces predictions or results to inputs by construction. There are no visible self-definitional elements, fitted parameters called predictions, or load-bearing self-citations. The approach is self-contained as it builds on external RL concepts and demonstrates through method and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard RL state-abstraction concepts and the assumption that real data can correct abstract dynamics; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption State abstraction concepts from the RL literature can be used to model the mismatch between abstract simulator and target task.
Invoked to formalize the problem in the abstract.

pith-pipeline@v0.9.0 · 5511 in / 1232 out tokens · 46948 ms · 2026-05-10T10:37:00.417742+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Outracing champion gran turismo drivers with deep reinforcement learning,

P. R. Wurman, S. Barrett, K. Kawamoto, J. MacGlashan, K. Subrama- nian, T. J. Walsh, R. Capobianco, A. Devlic, F. Eckert, F. Fuchset al., “Outracing champion gran turismo drivers with deep reinforcement learning,”Nature, vol. 602, no. 7896, pp. 223–228, 2022

2022
[2]

Learning dexterous in-hand manipulation,

O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Rayet al., “Learning dexterous in-hand manipulation,”The International Journal of Robotics Research, vol. 39, no. 1, pp. 3–20, 2020

2020
[3]

Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,

E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,”Arxiv Pre-print, 2019

2019
[4]

Learning agile and dynamic motor skills for legged robots,

J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V . Tsounis, V . Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,”Science Robotics, vol. 4, no. 26, p. eaau5872, 2019

2019
[5]

Sim-to-real transfer of robotic control with dynamics randomization,

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 3803–3810

2018
[6]

Comparative study of physics engines for robot simulation with mechanical interaction,

J. Yoon, B. Son, and D. Lee, “Comparative study of physics engines for robot simulation with mechanical interaction,”Applied Sciences, vol. 13, no. 2, p. 680, 2023

2023
[7]

From abstraction to reality: Darpa’s vision for robust sim-to-real autonomy,

E. Noorani, Z. Serlin, B. Price, and A. Velasquez, “From abstraction to reality: Darpa’s vision for robust sim-to-real autonomy,”AI Magazine, vol. 46, no. 2, p. e70015, 2025

2025
[8]

Rethinking sim2real: Lower fidelity simulation leads to higher sim2real transfer in navigation,

J. Truong, M. Rudolph, N. H. Yokoyama, S. Chernova, D. Batra, and A. Rai, “Rethinking sim2real: Lower fidelity simulation leads to higher sim2real transfer in navigation,” inConference on Robot Learning. PMLR, 2023, pp. 859–870

2023
[9]

Perspectives on Sim2Real Transfer for Robotics: A Summary of the R:SS 2020 Workshop,

S. H ¨ofer, K. Bekris, A. Handa, J. C. Gamboa, F. Golemo, M. Mozifian, C. Atkeson, D. Fox, K. Goldberg, J. Leonard, C. K. Liu, J. Peters, S. Song, P. Welinder, and M. White, “Perspectives on Sim2Real Transfer for Robotics: A Summary of the R:SS 2020 Workshop,” Dec. 2020, arXiv:2012.03806 [cs]. [Online]. Available: http://arxiv.org/abs/2012.03806

work page arXiv 2020
[10]

Multi-Robot Collaboration through Re- inforcement Learning and Abstract Simulation,

A. Labiosa and J. P. Hanna, “Multi-Robot Collaboration through Re- inforcement Learning and Abstract Simulation,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), May 2025

2025
[11]

Reinforcement Learning Within the Classical Robotics Stack: A Case Study in Robot Soccer,

A. Labiosa, Z. Wang, Agarwal, Siddhant, W. Cong, G. Hemkumar, A. N. Harish, B. Hong, J. Kelle, C. Li, Y . Li, Z. Shao, P. Stone, and J. P. Hanna, “Reinforcement Learning Within the Classical Robotics Stack: A Case Study in Robot Soccer,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2025

2025
[12]

What went wrong? closing the sim-to-real gap via differentiable causal discovery,

P. Huang, X. Zhang, Z. Cao, S. Liu, M. Xu, W. Ding, J. Francis, B. Chen, and D. Zhao, “What went wrong? closing the sim-to-real gap via differentiable causal discovery,” inConference on Robot Learning. PMLR, 2023, pp. 734–760

2023
[13]

Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey,

W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey,” in2020 IEEE Symposium Series on Computational Intelligence (SSCI), Dec. 2020, pp. 737–744

2020
[14]

People construct simplified mental representations to plan,

M. K. Ho, D. Abel, C. G. Correa, M. L. Littman, J. D. Cohen, and T. L. Griffiths, “People construct simplified mental representations to plan,”Nature, vol. 606, no. 7912, pp. 129–136, Jun. 2022, number: 7912 Publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/s41586-022-04743-9

2022
[15]

Multi-agent manipulation via locomotion using hierarchical sim2real,

O. Nachum, M. Ahn, H. Ponte, S. Gu, and V . Kumar, “Multi-agent manipulation via locomotion using hierarchical sim2real,”arXiv preprint arXiv:1908.05224, 2019

work page arXiv 1908
[16]

M ¨uller, A

M. M ¨uller, A. Dosovitskiy, B. Ghanem, and V . Koltun, “Driv- ing policy transfer via modularity and abstraction,”arXiv preprint arXiv:1804.09364, 2018

work page arXiv 2018
[17]

Gridtopix: Training embodied agents with minimal super- vision,

U. Jain, I.-J. Liu, S. Lazebnik, A. Kembhavi, L. Weihs, and A. G. Schwing, “Gridtopix: Training embodied agents with minimal super- vision,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 141–15 151

2021
[18]

Reinforcement learning with multi-fidelity simulators,

M. Cutler, T. J. Walsh, and J. P. How, “Reinforcement learning with multi-fidelity simulators,” in2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 3888–3895

2014
[19]

Available: https://www.sciencedirect.com/science/article/ pii/0005109889900022

K. J. ˚Astr¨om and P. Eykhoff, “System identification—A survey,” Automatica, vol. 7, no. 2, pp. 123–162, Mar. 1971. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0005109871900598

work page arXiv 1971
[20]

On finding ’exciting’ trajectories for identification experiments involving systems with non-linear dynamics,

B. Armstrong, “On finding ’exciting’ trajectories for identification experiments involving systems with non-linear dynamics,” in1987 IEEE International Conference on Robotics and Automation Proceedings, vol. 4, Mar. 1987, pp. 1131–1139. [Online]. Available: https: //ieeexplore.ieee.org/document/1087968

work page arXiv 1987
[21]

Using simulation and domain adaptation to improve efficiency of deep robotic grasping,

K. Bousmalis, A. Irpan, P. Wohlhart, Y . Bai, M. Kelcey, M. Kalakrish- nan, L. Downs, J. Ibarz, P. Pastor, K. Konoligeet al., “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” in2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 4243–4250

2018
[22]

Sim-to- real transfer with neural-augmented robot simulation,

F. Golemo, A. A. Taiga, A. Courville, and P.-Y . Oudeyer, “Sim-to- real transfer with neural-augmented robot simulation,” inConference on Robot Learning. PMLR, 2018, pp. 817–828

2018
[23]

Grounded action transformation for sim-to-real reinforcement learning,

J. P. Hanna, S. Desai, H. Karnan, G. Warnell, and P. Stone, “Grounded action transformation for sim-to-real reinforcement learning,”Machine Learning, vol. 110, no. 9, pp. 2469–2499, 2021

2021
[24]

Reinforced Grounded Action Transformation for Sim-to-Real Transfer,

H. Karnan, S. Desai, J. P. Hanna, G. Warnell, and P. Stone, “Reinforced Grounded Action Transformation for Sim-to-Real Transfer,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Las Vegas, NV , USA: IEEE, Oct. 2020, pp. 4397–4402. [Online]. Available: https://ieeexplore.ieee.org/document/9341149/

work page arXiv 2020
[25]

A Theory of Abstraction in Reinforcement Learning,

D. Abel, “A Theory of Abstraction in Reinforcement Learning,” Mar. 2022, arXiv:2203.00397 [cs]. [Online]. Available: http://arxiv.org/abs/ 2203.00397

work page arXiv 2022
[26]

Abstraction Selection in Model-based Reinforcement Learning,

N. Jiang, A. Kulesza, and S. Singh, “Abstraction Selection in Model-based Reinforcement Learning,” inProceedings of the 32nd International Conference on Machine Learning. PMLR, Jun. 2015, pp. 179–188, iSSN: 1938-7228. [Online]. Available: https://proceedings. mlr.press/v37/jiang15.html

2015
[27]

Abstract Reward Processes: Leveraging State Abstraction for Consistent Off- Policy Evaluation,

S. Chaudhari, A. Deshpande, B. C. d. Silva, and P. S. Thomas, “Abstract Reward Processes: Leveraging State Abstraction for Consistent Off- Policy Evaluation,” Oct. 2024, arXiv:2410.02172 [cs]. [Online]. Available: http://arxiv.org/abs/2410.02172

work page arXiv 2024
[28]

Learning Markov State Abstractions for Deep Reinforcement Learning,

C. Allen, N. Parikh, O. Gottesman, and G. Konidaris, “Learning Markov State Abstractions for Deep Reinforcement Learning,” 2021

2021
[29]

Predictive representations of state,

M. Littman and R. S. Sutton, “Predictive representations of state,” Advances in neural information processing systems, vol. 14, 2001

2001
[30]

Deep recurrent q-learning for partially observable mdps,

M. Hausknecht and P. Stone, “Deep recurrent q-learning for partially observable mdps,” in2015 aaai fall symposium series, 2015

2015
[31]

Sim-to-Real: Learning Agile Locomotion For Quadruped Robots,

J. Tan, T. Zhang, E. Coumans, A. Iscen, Y . Bai, D. Hafner, S. Bohez, and V . Vanhoucke, “Sim-to-Real: Learning Agile Locomotion For Quadruped Robots,” inProceedings of Robotics: Science and Systems,
[32]

Sim-to-Real: Learning Agile Locomotion For Quadruped Robots

[Online]. Available: http://arxiv.org/abs/1804.10332

work page Pith review arXiv
[33]

Approximate information state for approximate planning and reinforcement learning in partially observed systems,

J. Subramanian, A. Sinha, R. Seraj, and A. Mahajan, “Approximate information state for approximate planning and reinforcement learning in partially observed systems,”Journal of Machine Learning Research, vol. 23, no. 12, pp. 1–83, 2022

2022
[34]

On learning history-based policies for controlling markov decision processes,

G. Patil, A. Mahajan, and D. Precup, “On learning history-based policies for controlling markov decision processes,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2024, pp. 3511–3519

2024
[35]

BYOL-Explore: Exploration by Bootstrapped Prediction,

Z. D. Guo, S. Thakoor, M. P ˆıslar, B. A. Pires, F. Altch ´e, C. Tallec, A. Saade, D. Calandriello, J.-B. Grill, Y . Tang, M. Valko, R. Munos, M. G. Azar, and B. Piot, “BYOL-Explore: Exploration by Bootstrapped Prediction,” Jun. 2022, arXiv:2206.08332 [cs, stat]. [Online]. Available: http://arxiv.org/abs/2206.08332

work page arXiv 2022
[36]

Data-efficient reinforcement learning with self-predictive representations

M. Schwarzer, A. Anand, R. Goel, R. D. Hjelm, A. Courville, and P. Bachman, “Data-Efficient Reinforcement Learning with Self- Predictive Representations,” May 2021, arXiv:2007.05929 [cs, stat]. [Online]. Available: http://arxiv.org/abs/2007.05929

work page arXiv 2021
[37]

Rma: Rapid motor adaptation for legged robots

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,”arXiv preprint arXiv:2107.04034, 2021

work page arXiv 2021
[38]

Offline Reinforcement Learning with Implicit Q-Learning

I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit q-learning,”arXiv preprint arXiv:2110.06169, 2021

work page internal anchor Pith review arXiv 2021
[39]

D4rl: Datasets for deep data-driven reinforcement learning,

J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, “D4rl: Datasets for deep data-driven reinforcement learning,”Arxiv Pre-print, 2020

2020
[40]

Bench- marking deep reinforcement learning for continuous control,

Y . Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench- marking deep reinforcement learning for continuous control,” inInter- national conference on machine learning. PMLR, 2016

2016
[41]

Real-world humanoid locomotion with reinforcement learning,

I. Radosavovic, T. Xiao, B. Zhang, T. Darrell, J. Malik, and K. Sreenath, “Real-world humanoid locomotion with reinforcement learning,”Sci- ence Robotics, vol. 9, no. 89, p. eadi9579, 2024

2024