Active Embodiment Identification with Reinforcement Learning for Legged Robots

Jan Peters; Nico Bohlinger

arxiv: 2605.08020 · v1 · submitted 2026-05-08 · 💻 cs.RO

Active Embodiment Identification with Reinforcement Learning for Legged Robots

Nico Bohlinger , Jan Peters This is my paper

Pith reviewed 2026-05-11 02:28 UTC · model grok-4.3

classification 💻 cs.RO

keywords embodiment identificationreinforcement learninglegged robotsactive learningmorphology predictionsimulation trainingURMA architecturejoint parameter inference

0 comments

The pith

Legged robots can learn to identify their own joint and global embodiment parameters by jointly training information-seeking actions and explicit predictions with reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that legged robots can discover details about their own physical structure and joints purely through active interaction, without any initial model or labels. It does so by training a reinforcement learning policy that simultaneously chooses actions to gather informative data and outputs predictions of embodiment parameters. The approach relies on a history-augmented architecture to process sequences of observations and actions across varied robot shapes in simulation. A sympathetic reader would care because this removes the need for manual specification of each robot's properties, allowing greater flexibility when designs change or robots encounter new conditions.

Core claim

The paper presents an active embodiment identification method for legged robots that jointly learns information-seeking behavior and explicit embodiment prediction. Using a history-augmented URMA architecture, the method infers joint-level and global embodiment parameters through interaction with the environment in simulation across different morphologies.

What carries the argument

The history-augmented URMA architecture, which processes sequences of past actions and observations to support simultaneous reinforcement learning of an information-seeking policy and explicit prediction of joint-level and global embodiment parameters.

If this is right

Robots can adapt controllers or behaviors to unknown or changing morphologies without external labels or prior models.
The joint training produces both effective exploration actions and usable explicit predictions of embodiment details.
The method generalizes across multiple legged robot designs when trained in simulation.
Information-seeking behavior emerges as a direct result of optimizing the combined prediction and control objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If transferred successfully to hardware, the approach could support self-calibration after damage or part replacement on physical robots.
Similar active identification could apply to non-legged robots or to predicting other properties such as sensor calibration.
The work implies that embodiment awareness need not be pre-programmed but can arise from interaction-driven learning.
Further tests in environments with external disturbances would clarify how robust the inferred parameters remain.

Load-bearing premise

That the history-augmented URMA architecture combined with reinforcement learning can reliably infer accurate embodiment parameters from simulated interactions without requiring explicit supervision or prior morphology knowledge.

What would settle it

A test on a held-out robot morphology where the trained policy produces embodiment parameter predictions with high error or where random actions yield equally accurate predictions as the learned information-seeking policy.

Figures

Figures reproduced from arXiv: 2605.08020 by Jan Peters, Nico Bohlinger.

**Figure 2.** Figure 2: Embodiment identification errors for joint-level (top row) and general embodiment parameters (bottom row) across four legged robots. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

We present an active embodiment identification method for legged robots that jointly learns information-seeking behavior and explicit embodiment prediction. Using a history-augmented URMA architecture, the method infers joint-level and global embodiment parameters through interaction with the environment in simulation across different morphologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a joint RL-plus-URMA method for legged robots to learn their own parameters through interaction, but the abstract supplies no results or comparisons so the actual payoff is still unknown.

read the letter

The core idea is straightforward: train a policy that actively gathers information while a history-augmented URMA model predicts both joint-level and global embodiment parameters from the resulting trajectories. The claim is that this combination lets the robot identify its morphology without explicit labels or prior models, at least inside simulation and across a few different body shapes. That pairing of information-seeking behavior with explicit prediction is the part that feels new; most prior identification work either fixes the exploration strategy or treats prediction as a separate supervised step. The paper does a clean job stating the problem and the architecture choice, and the self-supervised framing via interaction is a reasonable fit for the setting. It also keeps the target parameters clearly defined rather than folding them into some fitted loss, which avoids the circularity trap. The main limitation is that nothing is shown yet. No error curves, no ablation on the history buffer or the reward for information gain, and no head-to-head numbers against simpler random-exploration baselines or existing system-ID methods. Everything stays in simulation, so questions about sensor noise, actuator limits, and sim-to-real gaps are left open. The scope is also narrow: legged platforms only, and the abstract does not indicate how far the learned policies generalize beyond the training morphologies. This is the sort of paper that matters to people working on adaptive legged control or online calibration. A reader who needs concrete techniques for robots that must handle changing mass or joint wear could pick up useful implementation details once the experiments are filled in. It is coherent enough on its own terms to go to referees; the method sketch is internally consistent and the motivation is practical. I would send it for review rather than desk-reject, mainly so the experiments and any literature overlaps can be checked properly.

Referee Report

1 major / 2 minor

Summary. The paper presents an active embodiment identification method for legged robots that jointly learns information-seeking behavior and explicit embodiment prediction. It employs a history-augmented URMA architecture to infer joint-level and global embodiment parameters through simulated environmental interactions across varying morphologies, without explicit supervision or prior morphology knowledge.

Significance. If the empirical results support the claims, the work would advance self-supervised adaptation in robotics by enabling legged systems to discover their own physical parameters via active exploration. The joint optimization of exploration policy and parameter prediction via RL is a coherent framing for embodiment identification tasks, and the simulation-based evaluation across morphologies offers a reproducible testbed. Such methods could reduce reliance on manual calibration in real-world deployments.

major comments (1)

[Abstract and §4 (Experiments)] The abstract and method sketch claim reliable inference of embodiment parameters from interactions, yet no quantitative results, error metrics, ablation studies, or baseline comparisons are supplied to validate accuracy or the necessity of the history-augmentation and joint learning components. This directly undermines assessment of whether the URMA+RL approach achieves the stated goals.

minor comments (2)

[Abstract] The acronym URMA is introduced without expansion or reference to prior work; a brief definition or citation would improve clarity for readers unfamiliar with the architecture.
[Method] Notation for joint-level versus global parameters should be introduced explicitly with symbols to avoid ambiguity when describing the prediction targets.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential impact. We address the single major comment below.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] The abstract and method sketch claim reliable inference of embodiment parameters from interactions, yet no quantitative results, error metrics, ablation studies, or baseline comparisons are supplied to validate accuracy or the necessity of the history-augmentation and joint learning components. This directly undermines assessment of whether the URMA+RL approach achieves the stated goals.

Authors: We agree that the current manuscript version does not provide quantitative error metrics, ablation studies, or baseline comparisons in §4 to substantiate the claims of reliable inference. We will revise the experiments section to include these: prediction error metrics (e.g., MSE for joint-level and global parameters across morphologies), ablations isolating the history-augmentation and joint RL components, and comparisons to baselines such as non-active or non-history-augmented variants. This will directly validate the approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a learning-based method that uses reinforcement learning together with a history-augmented URMA architecture to jointly acquire information-seeking policies and explicit embodiment-parameter predictors from simulated robot-environment interactions. No load-bearing step reduces by construction to a fitted input renamed as a prediction, a self-definitional equation, or a self-citation whose content is itself unverified. The central claim remains an empirical statement about what the combined RL-plus-neural architecture can achieve when trained on interaction data across morphologies; it does not presuppose the target parameters or the learned behavior inside its own definitions or loss functions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities can be identified; the description does not specify any fitted values, unproven assumptions, or new postulated components.

pith-pipeline@v0.9.0 · 5318 in / 1031 out tokens · 33235 ms · 2026-05-11T02:28:23.850938+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

The correspondence problem,

C. L. Nehaniv, K. Dautenhahnet al., “The correspondence problem,” Imitation in animals and artifacts, vol. 41, p. 28, 2002

work page 2002
[2]

Human-to-robot imitation in the wild,

S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” inRobotics: Science and Systems. RSS Foundation, 2022

work page 2022
[3]

Domain randomization for transferring deep neural networks from simulation to the real world,

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” inInternational conference on intelligent robots and systems, 2017

work page 2017
[4]

Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion,

G. Ji, J. Mun, H. Kim, and J. Hwangbo, “Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion,”Robotics and automation letters, vol. 7, no. 2, pp. 4630– 4637, 2022

work page 2022
[5]

One policy to run them all: an end-to-end learning approach to multi-embodiment locomotion,

N. Bohlinger, G. Czechmanowski, M. Krupka, P. Kicki, K. Walas, J. Peters, and D. Tateo, “One policy to run them all: an end-to-end learning approach to multi-embodiment locomotion,”Conference on Robot Learning, 2024

work page 2024
[6]

Towards embodiment scaling laws in robot locomotion,

B. Ai, L. Dai, N. Bohlinger, D. Li, T. Mu, Z. Wu, K. Fay, H. I. Christensen, J. Peters, and H. Su, “Towards embodiment scaling laws in robot locomotion,”Conference on Robot Learning (CoRL), 2025

work page 2025
[7]

Multi-embodiment locomotion at scale with extreme embodiment randomization,

N. Bohlinger and J. Peters, “Multi-embodiment locomotion at scale with extreme embodiment randomization,”arXiv preprint arXiv:2509.02815, 2025

work page arXiv 2025
[8]

Real-world embodied ai through a morphologically adaptive quadruped robot,

T. F. Nygaard, C. P. Martin, J. Torresen, K. Glette, and D. Howard, “Real-world embodied ai through a morphologically adaptive quadruped robot,”Nature Machine Intelligence, vol. 3, no. 5, pp. 410– 419, 2021

work page 2021
[9]

Identifiability and identification of inertial parameters using the underactuated base-link dynamics for legged multibody systems,

K. Ayusawa, G. Venture, and Y . Nakamura, “Identifiability and identification of inertial parameters using the underactuated base-link dynamics for legged multibody systems,”The International Journal of Robotics Research, vol. 33, no. 3, pp. 446–468, 2014

work page 2014
[10]

Contact invariant model learning for legged robot locomotion,

R. Grandia, D. Pardo, and J. Buchli, “Contact invariant model learning for legged robot locomotion,”IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 2291–2298, 2018

work page 2018
[11]

Floating-base deep lagrangian networks,

L. Schulze, J. D. Negri, V . Barasuol, V . S. Medeiros, M. Becker, J. Peters, and O. Arenz, “Floating-base deep lagrangian networks,” arXiv preprint arXiv:2510.17270, 2025

work page arXiv 2025
[12]

Online embodiment adaptation for quadrupedal locomotion,

D. Li, B. Ai, N. Bohlinger, J. Peters, H. I. Christensen, and H. Su, “Online embodiment adaptation for quadrupedal locomotion,” 2026

work page 2026
[13]

Locoformer: Generalist locomo- tion via long-context adaptation,

M. Liu, D. Pathak, and A. Agarwal, “Locoformer: Generalist locomo- tion via long-context adaptation,” inConference on Robot Learning. PMLR, 2025, pp. 532–546

work page 2025
[14]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Rl-x: A deep reinforcement learning library (not only) for robocup,

N. Bohlinger and K. Dorer, “Rl-x: A deep reinforcement learning library (not only) for robocup,” inRobot World Cup. Springer, 2023, pp. 228–239

work page 2023
[16]

Mujoco: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 5026–5033

work page 2012
[17]

Gait in eight: Efficient on-robot learning for omnidirectional quadruped locomotion,

N. Bohlinger, J. Kinzel, D. Palenicek, L. Antczak, and J. Peters, “Gait in eight: Efficient on-robot learning for omnidirectional quadruped locomotion,”International Conference on Intelligent Robots and Sys- tems, 2025

work page 2025

[1] [1]

The correspondence problem,

C. L. Nehaniv, K. Dautenhahnet al., “The correspondence problem,” Imitation in animals and artifacts, vol. 41, p. 28, 2002

work page 2002

[2] [2]

Human-to-robot imitation in the wild,

S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” inRobotics: Science and Systems. RSS Foundation, 2022

work page 2022

[3] [3]

Domain randomization for transferring deep neural networks from simulation to the real world,

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” inInternational conference on intelligent robots and systems, 2017

work page 2017

[4] [4]

Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion,

G. Ji, J. Mun, H. Kim, and J. Hwangbo, “Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion,”Robotics and automation letters, vol. 7, no. 2, pp. 4630– 4637, 2022

work page 2022

[5] [5]

One policy to run them all: an end-to-end learning approach to multi-embodiment locomotion,

N. Bohlinger, G. Czechmanowski, M. Krupka, P. Kicki, K. Walas, J. Peters, and D. Tateo, “One policy to run them all: an end-to-end learning approach to multi-embodiment locomotion,”Conference on Robot Learning, 2024

work page 2024

[6] [6]

Towards embodiment scaling laws in robot locomotion,

B. Ai, L. Dai, N. Bohlinger, D. Li, T. Mu, Z. Wu, K. Fay, H. I. Christensen, J. Peters, and H. Su, “Towards embodiment scaling laws in robot locomotion,”Conference on Robot Learning (CoRL), 2025

work page 2025

[7] [7]

Multi-embodiment locomotion at scale with extreme embodiment randomization,

N. Bohlinger and J. Peters, “Multi-embodiment locomotion at scale with extreme embodiment randomization,”arXiv preprint arXiv:2509.02815, 2025

work page arXiv 2025

[8] [8]

Real-world embodied ai through a morphologically adaptive quadruped robot,

T. F. Nygaard, C. P. Martin, J. Torresen, K. Glette, and D. Howard, “Real-world embodied ai through a morphologically adaptive quadruped robot,”Nature Machine Intelligence, vol. 3, no. 5, pp. 410– 419, 2021

work page 2021

[9] [9]

Identifiability and identification of inertial parameters using the underactuated base-link dynamics for legged multibody systems,

K. Ayusawa, G. Venture, and Y . Nakamura, “Identifiability and identification of inertial parameters using the underactuated base-link dynamics for legged multibody systems,”The International Journal of Robotics Research, vol. 33, no. 3, pp. 446–468, 2014

work page 2014

[10] [10]

Contact invariant model learning for legged robot locomotion,

R. Grandia, D. Pardo, and J. Buchli, “Contact invariant model learning for legged robot locomotion,”IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 2291–2298, 2018

work page 2018

[11] [11]

Floating-base deep lagrangian networks,

L. Schulze, J. D. Negri, V . Barasuol, V . S. Medeiros, M. Becker, J. Peters, and O. Arenz, “Floating-base deep lagrangian networks,” arXiv preprint arXiv:2510.17270, 2025

work page arXiv 2025

[12] [12]

Online embodiment adaptation for quadrupedal locomotion,

D. Li, B. Ai, N. Bohlinger, J. Peters, H. I. Christensen, and H. Su, “Online embodiment adaptation for quadrupedal locomotion,” 2026

work page 2026

[13] [13]

Locoformer: Generalist locomo- tion via long-context adaptation,

M. Liu, D. Pathak, and A. Agarwal, “Locoformer: Generalist locomo- tion via long-context adaptation,” inConference on Robot Learning. PMLR, 2025, pp. 532–546

work page 2025

[14] [14]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Rl-x: A deep reinforcement learning library (not only) for robocup,

N. Bohlinger and K. Dorer, “Rl-x: A deep reinforcement learning library (not only) for robocup,” inRobot World Cup. Springer, 2023, pp. 228–239

work page 2023

[16] [16]

Mujoco: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 5026–5033

work page 2012

[17] [17]

Gait in eight: Efficient on-robot learning for omnidirectional quadruped locomotion,

N. Bohlinger, J. Kinzel, D. Palenicek, L. Antczak, and J. Peters, “Gait in eight: Efficient on-robot learning for omnidirectional quadruped locomotion,”International Conference on Intelligent Robots and Sys- tems, 2025

work page 2025