pith. sign in

arxiv: 2606.00702 · v1 · pith:G43KIO2Dnew · submitted 2026-05-30 · 💻 cs.RO · cs.AI

Shape Your Body: Value Gradients for Multi-Embodiment Robot Design

Pith reviewed 2026-06-28 18:39 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords multi-embodimentrobot designvalue gradientsreinforcement learningmorphology optimizationdifferentiable surrogateembodiment parameters
0
0 comments X

The pith

A value function trained across many robot designs can optimize the body of a new robot using its gradients alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to train one policy and value function on a collection of different robot bodies. Once trained, the value function is frozen and used to guide changes to a robot's physical shape by following the gradients of its value estimates. This avoids running a full reinforcement learning process for each new design. The method works for both small tweaks to existing robots and for completely new shapes from different classes, using models trained on up to 50 robots with over 1100 parameters.

Core claim

Instead of co-designing policy and embodiment from scratch for each robot, a single embodiment-aware value function trained on many designs serves as a reusable, differentiable model that directly supplies gradients for improving new robot bodies.

What carries the argument

The frozen value function acting as a differentiable surrogate for embodiment optimization.

If this is right

  • Optimizing complete robot embodiments across held-out morphologies without retraining.
  • Identifying which design and control parameters most limit performance.
  • Scaling to design spaces with over 1100 continuous parameters using one model.
  • Analyzing new designs by highlighting performance bottlenecks via gradients.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Value gradients could extend to physical robot hardware if the value model generalizes to real-world dynamics.
  • Similar surrogate approaches might apply to other co-design problems like vehicle or aircraft shaping.
  • The method assumes access to a diverse training set of robots, which may limit use for entirely novel morphology classes.

Load-bearing premise

A value function trained on a set of robot embodiments will produce useful gradients for optimizing new and potentially dissimilar robot shapes.

What would settle it

Optimize a held-out robot from a different morphology class using the gradients from a value function trained on 50 other robots and measure whether the resulting design performs better than the original.

Figures

Figures reproduced from arXiv: 2606.00702 by Jan Peters, Nico Bohlinger.

Figure 1
Figure 1. Figure 1: Shape Your Body. We first train an embodiment-aware policy and value function with multi-embodiment reinforcement learning, then we optimize new designs by differentiating through the value function and applying the gradients inside a soft trust region around a reference design. Abstract: We propose to turn generalist multi-embodiment value functions into reusable models for robot design. Instead of runnin… view at source ↗
Figure 2
Figure 2. Figure 2: Optimized robot designs. We visualize two initial designs and the corresponding op￾timized designs for the Go2 and the MIT Humanoid. We also show the reference design and an optimized design for the Fourier GR1 T2 humanoid. Although some assets appear stretched or visu￾ally disconnected, the underlying geometries remain connected and controllable by the policy. Full optimization trajectories across initial… view at source ↗
Figure 3
Figure 3. Figure 3: Single-robot design. Mean return improvement ∆R over the initial perturbed design finit for 10 starts per robot. The nominal URDF design fref is shown as a reference, but is only used as the anchor for the trust region. This leads to the design gradient gn = ∇f Jˆ λ,B(fn) at iteration n, which contains a value gradient term obtained by backpropagation through the frozen critic and the design map Φ, and a s… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison to RL-based co-design. The first three plots show the return over envi￾ronment steps for Schaff2019 [21], FEACRL, Transform2Act, BodyGen, and Stackelberg PPO. The dashed line shows the final performance of VGDS after training a policy and critic on the full design space, and then designing from the same finit as the RL baselines started from. The fourth plot shows the cumulative time needed to c… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of training sets. Target robot is either held out from a morphology class set (open circles) or included in the full 50-robot training set (filled circles). We omit error bars for readability. vide insights into tuning control parameters. For the MIT Humanoid, the overall strongest changes are nominal joint positions and gains, together with reduced foot size. Copying only the optimized gains into t… view at source ↗
Figure 6
Figure 6. Figure 6: Overview of all 50 robots used in the multi-embodiment RL training [42]. Appendix A Experimental Setup Details A.1 Environment All robots are simulated in a fully JAX-jitted MJX locomotion environment implemented in RL-X. The environment runs at a 200 Hz simulation frequency with a 50 Hz control frequency, and each episode lasts at most 20 s, corresponding to 1,000 control steps. Episodes terminate early i… view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the direct-design URMA critic architecture. We simplify the visual￾ization: The foot encoder corresponds to the joint encoder, but with foot observations and foot description vectors as input, and the value heads all use the same structure. The linear layers in the value heads are all wrapped in WeightNorm (WN) layers [56]. B Direct-Design URMA Critic [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: x-y tracking error reduction for single-robot design. Mean reduction ∆exy in absolute x-y linear velocity tracking error relative to the initial perturbed design finit, for 10 starts per robot. 0.0 0.1 0.2 Tracking Error Reduction (Δexy, m/s) VGDS GC-PFO fref ARS CEM CMA-ES PSO DE BO TuRBO Random finit Unitree Go2 0.0 0.5 1.0 Tracking Error Reduction (Δexy, m/s) VGDS GC-PFO fref DE ARS CEM CMA-ES BO PSO Tu… view at source ↗
Figure 9
Figure 9. Figure 9: x-y tracking error reduction for multi-robot design. Mean reduction ∆exy in absolute x-y linear velocity tracking error when the target robot is either held out from its morphology class training set or included in the full 50-robot training set. We omit error bars for readability. Schaff2019. Schaff2019 maintains a distribution over designs and updates this distribution from rollout returns while training… view at source ↗
Figure 10
Figure 10. Figure 10: Design search on all 50 training robots. Per-robot return improvement ∆R = Rfound− Rinit of VGDS applied to each robot in the full 50-robot training set, starting from 10 uniformly sampled random initial designs per robot. for the rollout of the execution policy. Both the transform and execution policies are trained with PPO. BodyGen. BodyGen builds on the structure of Transform2Act and improves the archi… view at source ↗
Figure 11
Figure 11. Figure 11: Grouped design changes. Signed RMS changes of the optimized designs relative to the nominal reference, grouped by body part and parameter type. The heatmaps show which design groups are changed most consistently by VGDS. are the most useful isolated group. For Unitree Go2, the best result comes from combining the highlighted joint axis, foot geometry, and actuator velocity limit changes. For Golem, reduci… view at source ↗
Figure 12
Figure 12. Figure 12: Parameter group evaluations. Selected parameter groups from the optimized design f ⋆ are copied into the initial design finit and evaluated with the same policy. rollout return, with Spearman ρ = 0.40 overall and ρ = 0.50, 0.60, and 0.83 for MIT Humanoid, Unitree Go2, and Golem. The final design f ⋆ has both the highest value prediction and the highest rollout return for all three robots. I Design Search … view at source ↗
Figure 13
Figure 13. Figure 13: Design search trajectories for the Unitree Go2 quadruped. The columns show 8 different initial designs, and the rows show the initial design and VGDS at iteration 10, 20, 30, 40, and 50. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Design search trajectories for the MIT Humanoid. The columns show 8 different initial designs, and the rows show the initial design and VGDS at iteration 10, 20, 30, 40, and 50. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Design search trajectories for the Golem hexapod. The columns show 8 different initial designs, and the rows show the initial design and VGDS at iteration 10, 20, 30, 40, and 50. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
read the original abstract

We propose to turn generalist multi-embodiment value functions into reusable models for robot design. Instead of running a new reinforcement learning co-design loop for each robot, we first train an embodiment-aware policy and value function across many robot designs. After training, the frozen value function is used as a differentiable surrogate to optimize candidate embodiments through value gradients. We evaluate our approach across different robot design settings, from perturbed single robots to held-out robots across morphology classes, with single models trained on up to 50 robots and design spaces of over 1100 continuous embodiment parameters. Beyond optimizing complete embodiments, we show that value gradients can identify performance-limiting design and control parameters, enabling both the optimization and the analysis of new robot designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that a multi-embodiment value function trained across many robot designs can be frozen after training and reused as a differentiable surrogate for optimizing new robot embodiments via value gradients, avoiding per-robot RL co-design loops. It reports evaluation from perturbed single-robot cases to held-out robots across morphology classes, using models trained on up to 50 robots with design spaces exceeding 1100 continuous parameters, and demonstrates the gradients' utility for both optimization and identifying performance-limiting design/control parameters.

Significance. If the central claim holds, the work offers a practical route to amortize the cost of multi-embodiment RL across design tasks, enabling faster iteration on robot morphologies and parameter analysis without retraining. The approach of treating a frozen Q-function as a reusable design surrogate is a concrete contribution to co-design methods.

major comments (2)
  1. [Abstract, §4] Abstract and §4 (evaluation protocol): the claim that value gradients optimize held-out morphologies across classes rests on the assumption that the learned value surface extrapolates reliably outside the training distribution of ≤50 robots; the reported results do not indicate whether optimized trajectories were validated against ground-truth returns obtained by re-training or rolling out the embodiment-specific policy, leaving open the possibility that gradient steps exploit value-function artifacts rather than true performance gains.
  2. [§3.2] §3.2 (value-gradient optimization): the method treats the frozen value function as a surrogate whose gradients are used directly for embodiment-parameter descent; no analysis is provided of gradient norm stability or convexity properties when the embodiment parameters lie far from the training morphologies, which is load-bearing for the held-out-robot claim.
minor comments (2)
  1. Notation for embodiment parameters and state-embodiment concatenation should be introduced once with a clear table or diagram rather than inline.
  2. The abstract states 'over 1100 continuous embodiment parameters' but does not clarify whether this is the total dimensionality or per-robot; a sentence in §2 or §4 would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments in detail below and indicate the revisions we plan to make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (evaluation protocol): the claim that value gradients optimize held-out morphologies across classes rests on the assumption that the learned value surface extrapolates reliably outside the training distribution of ≤50 robots; the reported results do not indicate whether optimized trajectories were validated against ground-truth returns obtained by re-training or rolling out the embodiment-specific policy, leaving open the possibility that gradient steps exploit value-function artifacts rather than true performance gains.

    Authors: We appreciate this observation. Our current evaluation protocol includes direct comparisons of the optimized designs against baseline methods using the multi-embodiment policy and value function, as well as some ground-truth evaluations on perturbed cases. However, we acknowledge that for the held-out optimized robots, we have not reported results from re-training embodiment-specific policies to obtain independent ground-truth returns. This is a valid concern regarding potential artifacts. In the revised version, we will include additional experiments where we re-train policies on a selection of the optimized held-out embodiments and compare the returns to validate the performance improvements. revision: yes

  2. Referee: [§3.2] §3.2 (value-gradient optimization): the method treats the frozen value function as a surrogate whose gradients are used directly for embodiment-parameter descent; no analysis is provided of gradient norm stability or convexity properties when the embodiment parameters lie far from the training morphologies, which is load-bearing for the held-out-robot claim.

    Authors: We agree that further analysis of gradient behavior would be beneficial. Our empirical results across held-out robots demonstrate successful optimization, suggesting practical stability in the tested regimes. However, we did not provide explicit analysis of gradient norms or convexity. In the revision, we will add plots and discussion of gradient norms during the optimization process for held-out cases to address stability. A full theoretical analysis of convexity is beyond the scope as the value function is a neural network, but we can discuss the empirical properties. revision: partial

Circularity Check

0 steps flagged

No circularity: standard train-freeze-optimize pipeline with independent held-out evaluation

full rationale

The paper trains an embodiment-aware policy and value function across a collection of robot designs, then freezes the value function to serve as a differentiable surrogate for gradient-based embodiment optimization on new or held-out designs. This workflow does not reduce any claimed result to its own fitted parameters by construction; the optimization step uses the learned model as an external surrogate rather than re-deriving quantities already implicit in the training data. Evaluation explicitly includes held-out robots across morphology classes, providing an independent test of generalization outside the training distribution. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the described chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5645 in / 960 out tokens · 22422 ms · 2026-06-28T18:39:57.277653+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    K. Sims. Evolving 3d morphology and behavior by competition.Artificial life, 1(4):353–372, 1994

  2. [2]

    Lipson and J

    H. Lipson and J. B. Pollack. Automatic design and manufacture of robotic lifeforms.Nature, 406(6799):974–978, 2000

  3. [3]

    K. S. Luck, H. B. Amor, and R. Calandra. Data-efficient co-adaptation of morphology and behaviour with deep reinforcement learning. InConference on Robot Learning, pages 854–

  4. [4]

    Y . Yuan, Y . Song, Z. Luo, W. Sun, and K. Kitani. Transform2act: Learning a transform-and- control policy for efficient agent design.arXiv preprint arXiv:2110.03659, 2021

  5. [5]

    Bohlinger, G

    N. Bohlinger, G. Czechmanowski, M. Krupka, P. Kicki, K. Walas, J. Peters, and D. Tateo. One policy to run them all: an end-to-end learning approach to multi-embodiment locomotion. Conference on Robot Learning, 2024

  6. [6]

    B. Ai, L. Dai, N. Bohlinger, D. Li, T. Mu, Z. Wu, K. Fay, H. I. Christensen, J. Peters, and H. Su. Towards embodiment scaling laws in robot locomotion.Conference on Robot Learning (CoRL), 2025

  7. [7]

    R. C. Bertossa. Morphology and behaviour: functional links in development and evolution. Philosophical Transactions of the Royal Society B: Biological Sciences, 366(1574):2056– 2068, 2011

  8. [8]

    G. S. Hornby and J. B. Pollack. Body-brain co-evolution using l-systems as a generative encod- ing. InProceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation, pages 868–875, 2001

  9. [9]

    T. Wang, Y . Zhou, S. Fidler, and J. Ba. Neural graph evolution: Towards efficient automatic robot design.arXiv preprint arXiv:1906.05370, 2019

  10. [10]

    Banarse, Y

    D. Banarse, Y . Bachrach, S. Liu, G. Lever, N. Heess, C. Fernando, P. Kohli, and T. Graepel. The body is not a given: Joint agent policy learning and morphology evolution. InProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pages 1134–1142. International Foundation for Autonomous Agents and Multiagent Systems, 2019

  11. [11]

    A. Zhao, J. Xu, M. Konakovi ´c-Lukovi´c, J. Hughes, A. Spielberg, D. Rus, and W. Matusik. Robogrammar: graph grammar for terrain-optimized robot design.ACM Transactions on Graphics (TOG), 39(6):1–16, 2020

  12. [12]

    J. Xu, A. Spielberg, A. Zhao, D. Rus, and W. Matusik. Multi-objective graph heuristic search for terrestrial robot design. In2021 IEEE international conference on robotics and automation (ICRA), pages 9863–9869. IEEE, 2021

  13. [13]

    D. J. Hejna III, P. Abbeel, and L. Pinto. Task-agnostic morphology evolution.arXiv preprint arXiv:2102.13100, 2021

  14. [14]

    Schaff and M

    C. Schaff and M. R. Walter. N-limb: Neural limb optimization for efficient morphological design.arXiv preprint arXiv:2207.11773, 2022. 9

  15. [15]

    K. Qiu, W. Pałucki, K. Ciebiera, P. Fijałkowski, M. Cygan, and Ł. Kuci ´nski. Robomorph: Evolving robot morphology using large language models.arXiv preprint arXiv:2407.08626, 2024

  16. [16]

    C. Yu, W. Zhang, H. Lai, Z. Tian, L. Kneip, and J. Wang. Multi-embodiment legged robot control as a sequence modeling problem.arXiv preprint arXiv:2212.09078, 2022

  17. [17]

    J. Hu, J. Whitman, and H. Choset. Glso: Grammar-guided latent space optimization for sample-efficient robot design automation. InConference on Robot Learning, pages 1321–

  18. [18]

    Ikemura, Y

    K. Ikemura, Y . Dong, and F. T. Pokorny. Latent diffeomorphic co-design of end-effectors for deformable and fragile object manipulation.arXiv preprint arXiv:2602.17921, 2026

  19. [19]

    Identifying Inductive Biases for Robot Co-Design

    A. Vaish and O. Brock. Identifying inductive biases for robot co-design.arXiv preprint arXiv:2604.11768, 2026

  20. [20]

    D. Ha. Reinforcement learning for improving agent design.Artificial life, 25(4):352–365, 2019

  21. [21]

    Schaff, D

    C. Schaff, D. Yunis, A. Chakrabarti, and M. R. Walter. Jointly learning to construct and control agents using deep reinforcement learning. In2019 international conference on robotics and automation (ICRA), pages 9798–9805. IEEE, 2019

  22. [22]

    T. Chen, Z. He, and M. Ciocarlie. Hardware as policy: Mechanical and computational co- optimization using deep reinforcement learning. InConference on Robot Learning, pages 1158–1173. PMLR, 2021

  23. [23]

    Y . Wang, S. Wu, H. Fu, Q. Fu, T. Zhang, Y . Chang, and X. Wang. Curriculum-based co- design of morphology and control of voxel-based soft robots. InThe Eleventh International Conference on Learning Representations, 2023

  24. [24]

    H. Dong, J. Zhang, T. Wang, and C. Zhang. Symmetry-aware robot design with structured sub- groups. InInternational Conference on Machine Learning, pages 8334–8355. PMLR, 2023

  25. [25]

    M. Li, D. Matthews, and S. Kriegman. Reinforcement learning for freeform robot design. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 8799–8806. IEEE, 2024

  26. [26]

    H. Lu, Z. Wu, J. Xing, J. Li, R. Li, Z. Li, and Y . Shi. Bodygen: Advancing towards efficient embodiment co-design. InThe Thirteenth International Conference on Learning Representa- tions, 2025

  27. [27]

    Y . Dai, Y . Wang, D. R. Ashley, and J. Schmidhuber. Efficient morphology–control co-design via stackelberg PPO. InThe Fourteenth International Conference on Learning Representa- tions, 2026

  28. [28]

    Haarnoja, A

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

  29. [29]

    Kennedy and R

    J. Kennedy and R. Eberhart. Particle swarm optimization. InProceedings of ICNN’95- international conference on neural networks, volume 4, pages 1942–1948. ieee, 1995

  30. [30]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  31. [31]

    T. Wang, R. Liao, J. Ba, and S. Fidler. Nervenet: Learning structured policy with graph neural networks. InInternational conference on learning representations, 2018. 10

  32. [32]

    Huang, I

    W. Huang, I. Mordatch, and D. Pathak. One policy to control them all: Shared modular policies for agent-agnostic control. InInternational Conference on Machine Learning, pages 4455–4464. PMLR, 2020

  33. [33]

    Gupta, L

    A. Gupta, L. Fan, S. Ganguli, and L. Fei-Fei. Metamorph: learning universal controllers with transformers. InInternational Conference on Learning Representations. ICLR, 2022

  34. [34]

    Patel and S

    A. Patel and S. Song. Get-zero: Graph embodiment transformer for zero-shot embodiment generalization. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 14262–14269. IEEE, 2025

  35. [35]

    Sferrazza, D.-M

    C. Sferrazza, D.-M. Huang, F. Liu, J. Lee, and P. Abbeel. Body transformer: Leveraging robot embodiment for policy learning. InConference on Robot Learning, pages 3407–3424. PMLR, 2025

  36. [36]

    M. Liu, D. Pathak, and A. Agarwal. Locoformer: Generalist locomotion via long-context adaptation. InConference on Robot Learning, pages 532–546. PMLR, 2025

  37. [37]

    D. Li, B. Ai, N. Bohlinger, J. Peters, H. I. Christensen, and H. Su. Online embodiment adap- tation for quadrupedal locomotion. 2026

  38. [38]

    Smith, I

    L. Smith, I. Kostrikov, and S. Levine. A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning. InRobotics: Science and systems, 2023

  39. [39]

    Smith, Y

    L. Smith, Y . Cao, and S. Levine. Grow your limits: Continuous improvement with real-world rl for robotic locomotion. InInternational conference on robotics and automation, pages 10829– 10836. IEEE, 2024

  40. [40]

    J. Levy, T. Westenbroek, and D. Fridovich-Keil. Learning to walk from three minutes of real- world data with semi-structured dynamics models. InConference on robot learning, 2024

  41. [41]

    Bohlinger, J

    N. Bohlinger, J. Kinzel, D. Palenicek, L. Antczak, and J. Peters. Gait in eight: Efficient on-robot learning for omnidirectional quadruped locomotion.International Conference on Intelligent Robots and Systems, 2025

  42. [42]

    Bohlinger and J

    N. Bohlinger and J. Peters. Multi-embodiment locomotion at scale with extreme embodiment randomization.arXiv preprint arXiv:2509.02815, 2025

  43. [43]

    Balandat, B

    M. Balandat, B. Karrer, D. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy. Botorch: A framework for efficient monte-carlo bayesian optimization.Advances in neural information processing systems, 33:21524–21538, 2020

  44. [44]

    Hansen and A

    N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strate- gies.Evolutionary computation, 9(2):159–195, 2001

  45. [45]

    R. Y . Rubinstein and D. P. Kroese.The cross-entropy method: a unified approach to combi- natorial optimization, Monte-Carlo simulation, and machine learning, volume 133. Springer, 2004

  46. [46]

    Storn and K

    R. Storn and K. Price. Differential evolution–a simple and efficient heuristic for global opti- mization over continuous spaces.Journal of global optimization, 11(4):341–359, 1997

  47. [47]

    Mania, A

    H. Mania, A. Guy, and B. Recht. Simple random search of static linear policies is competitive for reinforcement learning.Advances in neural information processing systems, 31, 2018

  48. [48]

    Eriksson, M

    D. Eriksson, M. Pearce, J. Gardner, R. D. Turner, and M. Poloczek. Scalable global optimiza- tion via local bayesian optimization.Advances in neural information processing systems, 32, 2019. 11

  49. [49]

    J. Harb, T. Schaul, D. Precup, and P.-L. Bacon. Policy evaluation networks.preprint arXiv:2002.11833, 2020

  50. [50]

    Faccio, L

    F. Faccio, L. Kirsch, and J. Schmidhuber. Parameter-based value functions. InInternational Conference on Learning Representations, 2021

  51. [51]

    Faccio, A

    F. Faccio, A. Ramesh, V . Herrmann, J. Harb, and J. Schmidhuber. General policy evaluation and improvement by learning to identify few but crucial states. InDecision Awareness in Reinforcement Learning Workshop at ICML 2022, 2022

  52. [52]

    Bohlinger and J

    N. Bohlinger and J. Peters. Massively scaling explicit policy-conditioned value functions. arXiv preprint arXiv:2502.11949, 2025

  53. [53]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations, 2015

  54. [54]

    Bohlinger and K

    N. Bohlinger and K. Dorer. Rl-x: A deep reinforcement learning library (not only) for robocup. InRobot World Cup, pages 228–239. Springer, 2023

  55. [55]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–

  56. [56]

    Todorov, T

    IEEE, 2012. doi:10.1109/IROS.2012.6386109

  57. [57]

    Salimans and D

    T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to acceler- ate training of deep neural networks.Advances in neural information processing systems, 29, 2016. 12 Figure 6:Overview of all 50 robots used in the multi-embodiment RL training [42]. Appendix A Experimental Setup Details A.1 Environment All robots are simulated in ...