arxiv: 2604.22911 · v1 · submitted 2026-04-24 · 💻 cs.RO

Recognition: unknown

RecoverFormer: End-to-End Contact-Aware Recovery for Humanoid Robots

Zihui Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:18 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid recoveryend-to-end policycontact affordancelatent recovery modeszero-shot generalizationtransformer controlMuJoCo simulation

0 comments

The pith

A single end-to-end policy delivers multi-modal contact-aware recovery for humanoid robots and generalizes zero-shot across perturbation magnitudes, contact geometries, and dynamics shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RecoverFormer as a transformer-based policy that learns to recover humanoid robots from pushes by switching among compensatory stepping, hand contact with surfaces, and center-of-mass adjustments. It processes a 50-step history of observations through a causal transformer and adds a latent mode head for smooth strategy transitions plus a contact affordance head that identifies useful environmental surfaces such as walls. Trained only in open-floor simulation, the policy transfers without retraining to walled settings and altered physics including added mass, friction changes, and latency. A reader would care because the result points to one learned model handling behaviors that previously required separate controllers or explicit mode labels. If correct, this reduces the engineering needed to make humanoids stable in varied real environments.

Core claim

RecoverFormer is a fully end-to-end humanoid recovery policy that learns when and how to switch among recovery behaviors including compensatory stepping, hand-environment contact, and center-of-mass reshaping while maintaining robust performance under model mismatch. The architecture combines a causal transformer over a 50-step observation history with a latent recovery mode that enables smooth transitions among distinct recovery strategies and a contact affordance head that predicts which environmental surfaces are beneficial for stabilization. Trained only on open floor, RecoverFormer transfers zero-shot to walled environments achieving 100 percent recovery success across 100-300 N pushes,

What carries the argument

Causal transformer over 50-step observation history with latent recovery mode head for strategy transitions and contact affordance head for predicting useful surfaces.

If this is right

The policy achieves 100 percent success on 100-300 N pushes when walls are present at distances from 0.25 m to 1.4 m.
Under +25 percent mass it reaches 75.5 percent success, 89 percent under 30 ms added latency, 91.5 percent at low friction, and 99 percent under combined perturbations.
Latent recovery modes emerge that specialize across force regimes without any mode-level supervision.
The same policy maintains performance when contact geometry changes at test time.
Recovery behaviors remain robust when both contact geometry and dynamics parameters shift simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the simulation matches real contact physics, the policy could be deployed directly onto physical humanoids to handle unexpected disturbances in homes or warehouses without additional tuning.
The latent-mode and affordance design may generalize to other high-degree-of-freedom robots that must choose among locomotion, manipulation, and balance actions.
Combining history-based transformers with affordance prediction could reduce reliance on separate perception pipelines for contact planning.
The observed zero-shot transfer suggests testing whether similar architectures scale to continuous locomotion tasks that interleave recovery with navigation.

Load-bearing premise

The MuJoCo simulator accurately reproduces the contact forces, friction, and latency that govern real-robot behavior.

What would settle it

Running the trained policy on a physical Unitree G1 humanoid under matching push magnitudes, wall distances, and dynamics perturbations to measure actual recovery success.

Figures

Figures reproduced from arXiv: 2604.22911 by Zihui Liu.

**Figure 1.** Figure 1: RECOVERFORMER maintains balance through a lateral push on the Unitree G1. Mid-recovery snapshot 160 ms after a 180 N torso impulse. The transformer policy converts a 50-step history of proprioception and contact-region distances into 29-DoF joint targets at 50 Hz. time multi-contact planning that synthesizes hand braces against known surfaces [3]. Recent learning-based protective controllers [4] retain a f… view at source ↗

**Figure 2.** Figure 2: RECOVERFORMER architecture. A causal transformer encodes the 50-step observation history into et. The latent mode head predicts zt over K = 4 modes; the affordance head predicts ct ∈ [0, 1]Kc . Both feed the action decoder. X is processed by L = 4 identical causal transformer blocks, each implementing masked multi-head self-attention (MHSA) with Hh = 4 heads followed by a feed-forward sublayer: Xe = X + … view at source ↗

**Figure 3.** Figure 3: Representative open-floor recovery rollout. Frames: (0) pre-push stance; (1) impact with F applied for 100 ms; (2) compensatory whole-body motion; (3) stabilizing + steady-state recovery phase. 0 1 2 3 4 5 6 Time (s) 0 10 20 30 40 50 Torso tilt (deg) fall threshold 45 ∘ (a) Torso tilt under push perturbation 100 N 150 N 200 N 250 N 0 1 2 3 4 5 6 Time (s) 0.70 0.75 0.80 Base height (m) nominal 0.78 m (b) Ba… view at source ↗

**Figure 4.** Figure 4: Representative balance trajectories on open floor. (a) Torso tilt and (b) base height versus time during a single push episode at four force levels. The red band marks the 100 ms push interval at t=1.0 s. The tilt envelope decays exponentially with damping ratio that scales inversely with force; the base height dip grows monotonically from 1.5 cm at 100 N to 6.5 cm at 250 N and recovers within ∼ 1 s. Tilt … view at source ↗

read the original abstract

Humanoid robots operating in unstructured environments must recover from unexpected disturbances-a capability that remains challenging for end-to-end control policies. We present RECOVERFORMER, a fully end-to-end humanoid recovery policy that learns when and how to switch among recovery behaviors-including compensatory stepping, hand-environment contact, and center-of-mass reshaping-while maintaining robust performance under model mismatch. The architecture combines a causal transformer over a 50-step observation history with two novel heads: a latent recovery mode that enables smooth transitions among distinct recovery strategies, and a contact affordance head that predicts which environmental surfaces (walls, railings, table edges) are beneficial for stabilization. We evaluate RECOVERFORMER on the Unitree G1 humanoid in MuJoCo. Trained only on open floor, RECOVERFORMER transfers zero shot to walled environments, achieving 100% recovery success across 100-300 N pushes and across wall distances from 0.25-1.4m. Under zero-shot dynamics mismatch, RECOVERFORMER reaches 75.5% at plus +25% mass, 89% under 30 ms latency, 91.5% at low friction, and 99% under compound friction, latency and mass perturbation. The learned latent modes specialize across force regimes without mode-level supervision, validated by t-SNE analysis of 300 episodes. Taken together, these results show that a single end-to-end policy can deliver multi-modal, contact aware humanoid recovery that generalizes across perturbation magnitude, contact geometry, and dynamics shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RecoverFormer shows a transformer policy with latent modes and contact prediction that hits high sim success rates on push recovery and zero-shot wall transfer, but everything stays in MuJoCo.

read the letter

The main point is that this paper trains an end-to-end policy on open-floor data in MuJoCo and then reports 100% recovery on walled setups plus 75-99% under mass, latency, and friction shifts, all without any real-robot runs on the Unitree G1. The architecture uses a causal transformer over 50 steps plus heads for unsupervised latent recovery modes and contact affordance prediction, and the t-SNE check suggests the modes do cluster by force regime. That setup is a concrete way to get multi-modal behavior without hand-crafted mode labels, and the zero-shot transfer numbers are the clearest result they present. The training is standard RL on a normal objective, so the performance is not baked in by construction. The soft spots are straightforward. The abstract gives no baseline comparisons, no ablations, and no training curves or statistical details, which makes it hard to judge how much the new heads actually move the needle. More critically, every claim about generalization across contact geometry and dynamics mismatch rests on simulator fidelity alone. MuJoCo contact forces, friction, and injected latency may not match real actuator delays or surface compliance, so the headline generalization story remains conditional. This paper is aimed at people working on sim-based RL for legged robots who are looking for architectural ideas around contact awareness and mode switching. A reader already running humanoid experiments in simulation would pick up usable details on the transformer-plus-heads design and the mode visualization. It is coherent enough on its own terms to deserve a serious referee, even with the sim-only limitation. I would send it out for review but would flag the need for either hardware validation or a tighter discussion of what the sim results actually imply for hardware.

Referee Report

2 major / 0 minor

Summary. RecoverFormer is an end-to-end policy for humanoid recovery using a causal transformer with a latent recovery mode and contact affordance head. Trained on open-floor MuJoCo simulations of the Unitree G1, it claims zero-shot generalization to walled environments (100% success for 100-300N pushes, 0.25-1.4m walls) and dynamics mismatches (75.5% for +25% mass, 89% for 30ms latency, etc.), with t-SNE showing mode specialization without supervision.

Significance. This approach could significantly advance end-to-end learning for robust humanoid control in unstructured environments by enabling multi-modal contact-aware recovery without hand-crafted behaviors. The zero-shot transfer and learned specialization are strengths if they hold beyond simulation. The work provides concrete quantitative results on generalization across perturbation types.

major comments (2)

Abstract: The abstract reports specific success rates (e.g., 100% on walled environments, 75.5% under +25% mass) but does not mention baseline comparisons, ablation studies, training curves, or statistical details such as number of trials or variance, which are essential to evaluate whether the architecture drives the claimed generalization.
Evaluation section: All quantitative results, including zero-shot transfer and dynamics robustness, are obtained exclusively in MuJoCo simulation. The central claim of applicability to humanoid robots in unstructured environments relies on the unverified assumption that MuJoCo accurately models contact forces, friction, and latency; no physical experiments on the Unitree G1 are reported to support sim-to-real transfer.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their insightful comments on our work. We address each major comment in detail below, indicating the revisions we intend to make to the manuscript.

read point-by-point responses

Referee: Abstract: The abstract reports specific success rates (e.g., 100% on walled environments, 75.5% under +25% mass) but does not mention baseline comparisons, ablation studies, training curves, or statistical details such as number of trials or variance, which are essential to evaluate whether the architecture drives the claimed generalization.

Authors: We agree that the abstract would benefit from additional context regarding the evaluation methodology. In the revised version, we will update the abstract to include a brief mention of the baseline comparisons performed and indicate that success rates are computed over 100 trials per condition, with full statistical details (including variance) reported in the evaluation section. Ablation studies and training curves are presented in the main paper and will be referenced in the abstract where space permits. revision: yes
Referee: Evaluation section: All quantitative results, including zero-shot transfer and dynamics robustness, are obtained exclusively in MuJoCo simulation. The central claim of applicability to humanoid robots in unstructured environments relies on the unverified assumption that MuJoCo accurately models contact forces, friction, and latency; no physical experiments on the Unitree G1 are reported to support sim-to-real transfer.

Authors: We acknowledge that the evaluations are simulation-only, as described in the manuscript. Our robustness experiments test the policy under varied dynamics parameters to simulate real-world mismatches. We will revise the manuscript to include an expanded limitations paragraph discussing the fidelity of MuJoCo for contact modeling and our plans for future real-robot validation. However, physical experiments on the Unitree G1 are not included in this work. revision: partial

standing simulated objections not resolved

The lack of physical experiments on the Unitree G1 to validate sim-to-real transfer.

Circularity Check

0 steps flagged

No circularity; empirical RL results independent of inputs

full rationale

The paper trains a causal-transformer policy end-to-end on a standard RL objective using only open-floor MuJoCo data. All quantitative claims (success rates under pushes, wall distances, mass/latency/friction shifts, t-SNE mode clustering) are obtained from subsequent simulation rollouts. No algebraic derivation, parameter fit renamed as prediction, self-citation chain, or ansatz is invoked to produce the reported metrics; the generalization numbers are direct experimental outcomes rather than tautological restatements of the training setup.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the central claim rests on standard reinforcement-learning assumptions plus two new architectural components whose training details are not supplied.

free parameters (1)

neural network weights and hyperparameters
The transformer, mode head, and affordance head contain thousands of learned parameters whose values are determined by training rather than derived from first principles.

axioms (2)

domain assumption MuJoCo contact and dynamics model is sufficiently accurate for zero-shot generalization claims
All reported success rates are obtained inside simulation; transfer to hardware is not demonstrated.
standard math The MDP formulation with 50-step history is Markovian enough for stable recovery learning
Standard assumption in the transformer-RL setup described.

invented entities (2)

latent recovery mode no independent evidence
purpose: Enables smooth unsupervised transitions among compensatory stepping, hand contact, and CoM reshaping strategies
Introduced as a novel head whose specialization is validated post-hoc by t-SNE; no independent physical evidence is provided.
contact affordance head no independent evidence
purpose: Predicts beneficial environmental surfaces for stabilization
New output head whose predictions are used at inference time; no external validation of affordance accuracy is given.

pith-pipeline@v0.9.0 · 5571 in / 1670 out tokens · 62576 ms · 2026-05-08T11:18:51.699138+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 14 canonical work pages · 4 internal anchors

[1]

Real-time stabilization of a falling humanoid robot using hand contact: An optimal control approach,

S. Wang and K. Hauser, “Real-time stabilization of a falling humanoid robot using hand contact: An optimal control approach,” inIEEE- RAS 17th International Conference on Humanoid Robots (Humanoids). IEEE, 2017, pp. 454–460

2017
[2]

Development of push-recovery control system for humanoid robots using deep reinforcement learning,

E. Aslan, M. A. Arseric ¸m, and A. Uc ¸ar, “Development of push-recovery control system for humanoid robots using deep reinforcement learning,” Ain Shams Engineering Journal, 2023

2023
[3]

Realization of a real-time optimal control strategy to stabilize a falling humanoid robot with hand contact,

S. Wang and K. Hauser, “Realization of a real-time optimal control strategy to stabilize a falling humanoid robot with hand contact,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018

2018
[4]

SafeFall: Learning protective control for humanoid robots,

Z. Meng, T. Liu, L. Ma, Y . Wu, R. Song, W. Zhang, and S. Huang, “SafeFall: Learning protective control for humanoid robots,”arXiv preprint arXiv:2511.18509, 2025

work page arXiv 2025
[5]

Unified multi-contact fall mitigation plan- ning for humanoids via contact transition tree optimization,

S. Wang and K. Hauser, “Unified multi-contact fall mitigation plan- ning for humanoids via contact transition tree optimization,” inIEEE- RAS 18th International Conference on Humanoid Robots (Humanoids). IEEE, 2018, pp. 1–9

2018
[6]

FRASA: An end-to-end reinforcement learning agent for fall recovery and stand up of humanoid robots,

C. Gaspard, M. Duclusaud, G. Passault, M. Daniel, and O. Ly, “FRASA: An end-to-end reinforcement learning agent for fall recovery and stand up of humanoid robots,” inIEEE International Conference on Robotics and Automation (ICRA), 2025

2025
[7]

Unified humanoid fall-safety policy from a few demonstrations,

Z. Xu, Y . Li, K.-y. Lin, and S. X. Yu, “Unified humanoid fall-safety policy from a few demonstrations,”arXiv preprint arXiv:2511.07407, 2025

work page arXiv 2025
[8]

Towards a maximally-robust self-balancing bicycle without reaction-moment gyro- scopes or reaction wheels,

A. M. Sharma, S. Wang, Y .-M. Zhou, and A. Ruina, “Towards a maximally-robust self-balancing bicycle without reaction-moment gyro- scopes or reaction wheels,” inBicycle and Motorcycle Dynamics, 2016

2016
[9]

Learning getting-up policies for real-world humanoid robots,

X. He, R. Dong, Z. Chen, and S. Gupta, “Learning getting-up policies for real-world humanoid robots,”arXiv preprint arXiv:2502.12152, 2025

work page arXiv 2025
[10]

HoST: Learning humanoid standing-up control across diverse postures,

T. Huang, J. Ren, H. Wang, Z. Wang, Q. Ben, M. Wen, X. Chen, J. Li, and J. Pang, “HoST: Learning humanoid standing-up control across diverse postures,”arXiv preprint arXiv:2502.08378, 2025

work page arXiv 2025
[11]

Efficient online calibration for au- tonomous vehicle’s longitudinal dynamical system: A Gaussian model approach,

S. Wang, C. Deng, and Q. Qi, “Efficient online calibration for au- tonomous vehicle’s longitudinal dynamical system: A Gaussian model approach,” inProceedings of the Conference, 2023

2023
[12]

Real-world humanoid locomotion with reinforcement learning,

I. Radosavovic, T. Xiao, B. Zhang, T. Darrell, J. Malik, and K. Sreenath, “Real-world humanoid locomotion with reinforcement learning,”Sci- ence Robotics, vol. 9, 2024

2024
[13]

Humanoid locomotion as next token predic- tion,

I. Radosavovic, B. Zhang, B. Shi, J. Rajasegaran, S. Kamat, T. Darrell, K. Sreenath, and J. Malik, “Humanoid locomotion as next token predic- tion,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[14]

ExBody: Expressive whole-body control for humanoid robots,

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang, “ExBody: Expressive whole-body control for humanoid robots,” inRobotics: Science and Systems (RSS), 2024

2024
[15]

Exbody2: Ad- vanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024

M. Ji, X. Peng, F. Liu, J. Li, G. Yang, X. Cheng, and X. Wang, “ExBody2: Advanced expressive humanoid whole-body control,”arXiv preprint arXiv:2412.13196, 2024

work page arXiv 2024
[16]

WoCoCo: Learning whole-body humanoid control with sequential contacts,

C. Zhang, W. Xiao, T. He, and G. Shi, “WoCoCo: Learning whole-body humanoid control with sequential contacts,” inConference on Robot Learning (CoRL), 2024

2024
[17]

SENTINEL: A fully end-to-end language-action model for humanoid whole body control,

Y . Wang, H. Jiang, S. Yao, Z. Ding, and Z. Lu, “SENTINEL: A fully end-to-end language-action model for humanoid whole body control,” arXiv preprint arXiv:2511.19236, 2025

work page arXiv 2025
[18]

LangWBC: Language-directed humanoid whole-body control via end-to-end learning,

Y . Shao, B. Zhang, Q. Liao, X. Huang, Y . Gao, Y . Chi, Z. Li, S. Shao, and K. Sreenath, “LangWBC: Language-directed humanoid whole-body control via end-to-end learning,” inRobotics: Science and Systems (RSS), 2025

2025
[19]

HOVER: Versatile neural whole-body controller for humanoid robots,

T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wang, L. Fan, and Y . Zhu, “HOVER: Versatile neural whole-body controller for humanoid robots,”arXiv preprint arXiv:2410.21229, 2024

work page arXiv 2024
[20]

LeVERB: Humanoid whole-body control with latent vision-language instruction,

H. Xue, X. Huang, D. Niu, Q. Liao, T. Kragerud, J. T. Gravdahl, X. B. Peng, G. Shi, T. Darrell, K. Sreenath, and S. Sastry, “LeVERB: Humanoid whole-body control with latent vision-language instruction,” arXiv preprint arXiv:2506.13751, 2025

work page arXiv 2025
[21]

ASAP: Aligning simulation and real-world physics for learning agile humanoid whole-body skills,

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbabu, C. Panet al., “ASAP: Aligning simulation and real-world physics for learning agile humanoid whole-body skills,” inRobotics: Science and Systems (RSS), 2025

2025
[22]

KungfuBot: Physics-based humanoid whole-body control for learning highly-dynamic skills,

W. Xie, J. Han, J. Zhenget al., “KungfuBot: Physics-based humanoid whole-body control for learning highly-dynamic skills,” inAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[23]

Learning contact- rich whole-body manipulation with example-guided reinforcement learn- ing,

J. A. Barreiros, A. O. ¨Onol, M. Zhang, S. Creasey, A. Goncalves, A. Beaulieu, A. Bhat, K. M. Tsui, and A. Alspach, “Learning contact- rich whole-body manipulation with example-guided reinforcement learn- ing,”Science Robotics, vol. 10, p. eads6790, 2025

2025
[24]

TACT: Humanoid whole-body contact manipulation through deep imitation learning with tactile modality,

M. Murooka, T. Hoshi, K. Fukumitsu, S. Masuda, M. Hamze, T. Sasaki, M. Morisawa, and E. Yoshida, “TACT: Humanoid whole-body contact manipulation through deep imitation learning with tactile modality,” IEEE Robotics and Automation Letters (RA-L), vol. 10, no. 8, pp. 7819– 7826, 2025

2025
[25]

Nasiriany, S

S. Nasiriany, S. Kirmani, T. Ding, L. Smith, Y . Zhu, D. Driess, D. Sadigh, and T. Xiao, “RT-Affordance: Affordances are versatile intermediate representations for robot manipulation,”arXiv preprint arXiv:2411.02704, 2024

work page arXiv 2024
[26]

A0: An affordance-aware hierarchical model for general robotic manipulation,

R. Xu, J. Zhang, M. Guoet al., “A0: An affordance-aware hierarchical model for general robotic manipulation,” inInternational Conference on Computer Vision (ICCV), 2025

2025
[27]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Blacket al., “π 0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review arXiv 2024
[28]

Helix: A vision-language-action model for generalist hu- manoid control,

Figure AI, “Helix: A vision-language-action model for generalist hu- manoid control,” https://www.figure.ai/news/helix, 2025

2025
[29]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadevet al., “GR00T N1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review arXiv 2025
[30]

WholeBodyVLA: Towards unified latent VLA for whole-body loco-manipulation control,

OpenDriveLabet al., “WholeBodyVLA: Towards unified latent VLA for whole-body loco-manipulation control,” inInternational Conference on Learning Representations (ICLR), 2026

2026
[31]

RMA: Rapid motor adaptation for legged robots,

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “RMA: Rapid motor adaptation for legged robots,” inRobotics: Science and Systems (RSS), 2021

2021
[32]

World model implanting for test-time adaptation of embodied agents,

M. Yoo, S. Shin, D. Sub, and D. Lee, “World model implanting for test-time adaptation of embodied agents,” inInternational Conference on Machine Learning (ICML), 2025

2025
[33]

TARC: Time-adaptive robotic control,

A. Sukhija, L. Treven, J. Cheng, F. D ¨orfler, S. Coros, and A. Krause, “TARC: Time-adaptive robotic control,”arXiv preprint arXiv:2510.23176, 2025

work page arXiv 2025
[34]

MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents

D. Jiang, Y . Li, G. Li, and B. Li, “MAGMA: A multi-graph based agentic memory architecture for AI agents,”arXiv preprint arXiv:2601.03236, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in Neural Information Processing Systems (NeurIPS), 2017

2017
[36]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review arXiv 2017
[37]

MuJoCo: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” inIEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), 2012

2012