pith. sign in

arxiv: 2606.20705 · v1 · pith:3SOCPJH5new · submitted 2026-06-15 · 💻 cs.CV · cs.AI· cs.RO

MotionPyramid: Hierarchical Motion Representation and Residual Interfaces

Pith reviewed 2026-06-27 04:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords hierarchical motion representationresidual interfaceshumanoid controlreinforcement learningmotion trackinglatent decodersmulti-level action interfacesmotion hierarchy
0
0 comments X

The pith

Motion can be organized as a reusable hierarchy of latent decoders that serve as multi-resolution action interfaces for RL policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether motion admits the same kind of layered structure that perception does, from immediate motor commands up to extended behaviors such as gait cycles or balance recovery. It constructs this structure by training a recursive stack of latent decoders on motion-tracking data so that higher-level latents unfold into sequences of lower-level commands. Once the stack is frozen, reinforcement-learning policies can select actions at any chosen level of the hierarchy. Coarser levels reduce the space of plausible motions and thereby speed early learning, while finer levels and residual corrections keep the controller responsive to task feedback. The result is structured abstraction that still permits precise, editable control across time scales.

Core claim

MotionPyramid trains a recursive stack of latent decoders from a motion-tracking teacher. Low-level latents decode directly to full-body motor commands, while each higher level decodes into a sequence of commands at the level below, thereby producing temporally extended motion programs. After pretraining, the entire hierarchy is frozen and exposed to downstream RL policies as a family of action interfaces at different temporal resolutions. Representation probes confirm that the learned levels support traversal, interpolation, transition, and composition. Residual Interfaces further allow a single policy to issue coarse segment-level commands and frame-level corrections simultaneously through

What carries the argument

a recursive stack of latent decoders in which higher-level latents unfold through lower levels into temporally extended motion programs

If this is right

  • Coarser interfaces constrain exploration to structured motion segments and thereby improve early learning and motion regularity.
  • Finer interfaces preserve closed-loop feedback and final task precision.
  • The hierarchy supports explicit traversal, interpolation, and qualitative composition of motions.
  • Residual Interfaces let coarse motion programs and fine corrections coexist inside one controller.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the learned levels prove reusable across tasks, the same pretraining procedure could supply motion priors for other sequential control problems.
  • The residual-interface pattern could be tested by measuring how much performance drops when the skip connections between levels are removed.
  • The same recursive-decoder construction might be applied to other temporally extended signals such as speech or video synthesis.

Load-bearing premise

Training a recursive stack of latent decoders on motion-tracking data will automatically yield levels that remain meaningful and reusable when frozen and inserted as action interfaces into downstream RL policies.

What would settle it

An RL policy using the frozen MotionPyramid levels as action interfaces shows no gain in sample efficiency or final task performance compared with an otherwise identical policy that acts directly on raw motor commands.

Figures

Figures reproduced from arXiv: 2606.20705 by Gao Zhu, Yubei Chen, Zaishuo Xia.

Figure 1
Figure 1. Figure 1: Overview of MotionPyramid. Left: representation probes visualize sampling, interpolation, traversal, and composition through the frozen hierarchy. Right: a hierarchy of reusable action interfaces recursively unfolds coarse latent decisions into lower level latents and motor commands. Middle: fixed pyramid levels reveal a tradeoff between learning speed and final precision across downstream tasks, while Res… view at source ↗
Figure 2
Figure 2. Figure 2: MotionPyramid improves both action representation learning and downstream control. Top: during recursive distillation, higher levels learn faster at early stages while lower levels preserve a stronger final control ceiling. Bottom: for downstream reinforcement learning, we compare fixed MotionPyramid interfaces, Mixture of Interfaces, and Residual Interfaces on speed, reach, and strike. Mixture of Interfac… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison against 30 Hz latent baselines. We compare Residual Interfaces with scratch training and 30 Hz latent action baselines on speed, reach, and strike [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative latent traversals at z (3) . For each row, we fix the proprioceptive state and all latent coordinates except one coordinate, then vary the selected coordinate from low (−2σ) to the prior mean and high (+2σ) in normalized prior units. Since one z (3) decision unfolds over H3 simulator steps, each cell visualizes a short rollout snippet rather than a single rendered frame. The three rows show … view at source ↗
Figure 5
Figure 5. Figure 5: Skill transition probe using the frozen MotionPyramid hierarchy. The rollout transitions between running, martial arts motion, running, jumping, and running again. The sequence illustrates that the learned hierarchy can move between distinct motion modes while preserving physically stable whole body behavior [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Selection of Layers in Mixture of Interfaces. We plot the fraction of selected horizons over downstream training for speed, reach, strike, and their mean. behavior without discarding the overall motion structure [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

We ask whether the representational hierarchy seen in perception, from local primitives such as edges to higher level structures such as parts and objects, can be established for motion. In humanoid control, low level actions specify immediate motor commands, while meaningful behavior is organized over longer temporal scales, including contacts, gait fragments, balance recovery, reaching, and whole body skills. We introduce MotionPyramid, a hierarchical action representation that learns such structure from motion data. Starting from a motion tracking teacher, it trains a recursive stack of latent decoders: low level latents decode to immediate full body motor commands, while higher level latents unfold through lower levels into temporally extended motion programs. After pretraining, the hierarchy is frozen and reused by downstream reinforcement learning policies as a family of action interfaces at different control resolutions. Experiments show the learned levels form a motion hierarchy: coarser interfaces improve early learning and motion regularity by constraining exploration to structured segments, while finer interfaces preserve feedback control and final task precision. Representation probes show the hierarchy supports traversal, interpolation, transition, and qualitative composition, exposing editable control handles across temporal scales. Finally, we introduce Residual Interfaces, letting a downstream policy maintain coarse, segment level, and frame level residual commands through the frozen hierarchy. Analogous to residual or skip connections in deep networks, this allows coarse motion programs and fine residual corrections to coexist within one controller. MotionPyramid shows that motion, like perception, can be organized into a reusable multi level representation, providing structured abstraction without sacrificing controllability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces MotionPyramid, a hierarchical motion representation for humanoid control learned from a motion tracking teacher via a recursive stack of latent decoders. Low-level latents decode to immediate motor commands while higher levels unfold into temporally extended programs. After pretraining the hierarchy is frozen and inserted as multi-resolution action interfaces into downstream RL policies. Residual Interfaces allow a policy to issue coarse-to-fine residual commands through the frozen stack. The abstract states that experiments and representation probes confirm that coarser levels constrain exploration and improve regularity while finer levels preserve precision, and that the levels support traversal, interpolation, transition, and composition.

Significance. If the reported experiments hold with appropriate controls and baselines, the work would demonstrate a reusable, multi-scale motion abstraction that improves sample efficiency and controllability in RL without sacrificing final-task performance. The residual-interface mechanism is a concrete engineering contribution that could be adopted in other hierarchical control settings.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'experiments show the learned levels form a motion hierarchy' and that 'coarser interfaces improve early learning' rests entirely on described results, yet the abstract supplies no quantitative metrics, baselines, task definitions, or statistical controls. Without these the support for the hierarchy-benefit claim cannot be evaluated and the claim is not yet load-bearing.
  2. [Abstract] Abstract: the description of the recursive latent-decoder stack and the claim that higher-level latents 'unfold through lower levels into temporally extended motion programs' is presented without any equations, loss functions, or training details. This makes it impossible to assess whether the hierarchy is learned or imposed by construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address each major point below and indicate where revisions to the manuscript are appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'experiments show the learned levels form a motion hierarchy' and that 'coarser interfaces improve early learning' rests entirely on described results, yet the abstract supplies no quantitative metrics, baselines, task definitions, or statistical controls. Without these the support for the hierarchy-benefit claim cannot be evaluated and the claim is not yet load-bearing.

    Authors: We agree that the abstract is concise and omits specific numbers. The full manuscript reports quantitative results, including task definitions, baselines, and multi-seed statistics in the Experiments section. To make the abstract's claims more self-contained, we will revise it to include brief quantitative highlights of the reported benefits. revision: yes

  2. Referee: [Abstract] Abstract: the description of the recursive latent-decoder stack and the claim that higher-level latents 'unfold through lower levels into temporally extended motion programs' is presented without any equations, loss functions, or training details. This makes it impossible to assess whether the hierarchy is learned or imposed by construction.

    Authors: The abstract summarizes at a high level. The manuscript details the recursive training from the motion-tracking teacher, the unfolding mechanism, and the per-level losses in Section 3, confirming the hierarchy is learned. We will revise the abstract to explicitly state that the levels are learned via this procedure. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method for learning hierarchical motion representations via a recursive stack of latent decoders pretrained on motion tracking data and then frozen for use in downstream RL. No mathematical derivations, first-principles predictions, or equations are described that could reduce to fitted inputs by construction. Claims rest on experimental outcomes (hierarchy properties, residual interfaces) rather than any self-definitional or fitted-input structure. No self-citations or uniqueness theorems are invoked in the provided text. The derivation chain is therefore self-contained as a standard training-and-evaluation pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the hierarchy itself is learned rather than postulated with independent evidence.

pith-pipeline@v0.9.1-grok · 5804 in / 1093 out tokens · 28973 ms · 2026-06-27T04:11:35.224334+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 10 canonical work pages

  1. [1]

    The option critic architecture

    Pierre Luc Bacon, Jean Harb, and Doina Precup. The option critic architecture. InProceedings of the AAAI Conference on Artificial Intelligence, 2017

  2. [2]

    Bernstein.The Coordination and Regulation of Movements

    Nikolai A. Bernstein.The Coordination and Regulation of Movements. Pergamon Press, Oxford, 1967

  3. [3]

    Imitate and repurpose: Learning reusable robot movement skills from human and animal behaviors, 2022

    Steven Bohez, Saran Tunyasuvunakool, Philemon Brakel, Fereshteh Sadeghi, Leonard Hasen- clever, Yuval Tassa, Emilio Parisotto, Jan Humplik, Tuomas Haarnoja, Roland Hafner, Markus Wulfmeier, Michael Neunert, Ben Moran, Noah Siegel, Andrea Huber, Francesco Romano, Nathan Batchelor, Federico Casarini, Josh Merel, Raia Hadsell, and Nicolas Heess. Imitate and ...

  4. [4]

    Paiton, and Bruno A

    Yubei Chen, Dylan M. Paiton, and Bruno A. Olshausen. The sparse manifold trans- form. InAdvances in Neural Information Processing Systems, volume 31, pages 10534–10545, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ 8e19a39c36b8e5e3afd2a3b2692aea96-Abstract.html

  5. [5]

    Dietterich

    Thomas G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition.Journal of Artificial Intelligence Research, 13:227–303, 2000. doi: 10.1613/ jair.639

  6. [6]

    Diversity is all you need: Learning skills without a reward function

    Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=SJx63jRqFm

  7. [7]

    Latent space policies for hierarchical reinforcement learning.arXiv preprint arXiv:1804.02808, 2018

    Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning.arXiv preprint arXiv:1804.02808, 2018

  8. [8]

    Neural motion simulator: Pushing the limit of world models in reinforcement learning

    Chenjie Hao, Weyl Lu, Yifan Xu, and Yubei Chen. Neural motion simulator: Pushing the limit of world models in reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27608–27617, 2025. URL https://openaccess.thecvf.com/content/CVPR2025/html/Hao_Neural_Motion_ Simulator_Pushing_the_Limit_of_World_M...

  9. [9]

    CoMic: Complementary task learning and mimicry for reusable skills

    Leonard Hasenclever, Fabio Pardo, Raia Hadsell, Nicolas Heess, and Josh Merel. CoMic: Complementary task learning and mimicry for reusable skills. InProceedings of the 37th Inter- national Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 4105–4115. PMLR, 2020. URL https://proceedings.mlr.press/v119/ hasenclever20a.html

  10. [10]

    Learning an embedding space for transferable robot skills

    Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an embedding space for transferable robot skills. InInternational Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=rk07ZXZRb

  11. [11]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 11

  12. [12]

    RG-flow: A hierar- chical and explainable flow model based on renormalization group and sparse prior.Machine Learning: Science and Technology, 3(3):035009, August 2022

    Hong-Ye Hu, Dian Wu, Yi-Zhuang You, Bruno Olshausen, and Yubei Chen. RG-flow: A hierar- chical and explainable flow model based on renormalization group and sparse prior.Machine Learning: Science and Technology, 3(3):035009, August 2022. doi: 10.1088/2632-2153/ac8393. URLhttps://doi.org/10.1088/2632-2153/ac8393

  13. [13]

    Simple emergent action representations from multi- task policy training

    Pu Hua, Yubei Chen, and Huazhe Xu. Simple emergent action representations from multi- task policy training. InInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NUl0ylt7SM

  14. [14]

    Dynamical movement primitives: Learning attractor models for motor behaviors.Neural Computation, 25 (2):328–373, 2013

    Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal. Dynamical movement primitives: Learning attractor models for motor behaviors.Neural Computation, 25 (2):328–373, 2013. doi: 10.1162/NECO_a_00393

  15. [15]

    Learning multi-level hierar- chies with hindsight

    Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. Learning multi-level hierar- chies with hindsight. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryzECoAcY7

  16. [16]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017

  17. [17]

    Character controllers using motion vaes.ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH), 39(4), 2020

    Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel van de Panne. Character controllers using motion vaes.ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH), 39(4), 2020

  18. [18]

    Perpetual humanoid control for real time simulated avatars.arXiv preprint arXiv:2305.06456, 2023

    Zhengyi Luo, Jinkun Cao, Alexander Winkler, Kris Kitani, and Weipeng Xu. Perpetual humanoid control for real time simulated avatars.arXiv preprint arXiv:2305.06456, 2023

  19. [19]

    Kitani, and Weipeng Xu

    Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris M. Kitani, and Weipeng Xu. Universal humanoid motion representations for physics-based control. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=OrOd8PxOO2

  20. [20]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5442–5451, 2019. doi: 10.1109/ICCV . 2019.00554

  21. [21]

    Neural probabilistic motor primitives for humanoid control

    Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. arXiv preprint arXiv:1811.11711, 2018

  22. [22]

    Data-efficient hierarchical reinforcement learning

    Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. InAdvances in Neural Information Processing Systems, volume 31, pages 3307–3317, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ e6384711491713d29bc63fc5eeb5ba4f-Abstract.html

  23. [23]

    Near-optimal representation learning for hierarchical reinforcement learning

    Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Near-optimal representation learning for hierarchical reinforcement learning. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=H1emus0qF7

  24. [24]

    Peters, and Gerhard Neumann

    Alexandros Paraschos, Christian Daniel, Jan R. Peters, and Gerhard Neumann. Probabilistic movement primitives. InAdvances in Neural Information Processing Systems, volume 26, pages 2616–2624, 2013

  25. [25]

    Ronald Parr and Stuart J. Russell. Reinforcement learning with hierarchies of machines. InAdvances in Neural Information Processing Systems, vol- ume 10, 1997. URL https://proceedings.neurips.cc/paper/1997/hash/ 5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html

  26. [26]

    Deepmimic: Example guided deep reinforcement learning of physics based character skills.ACM Transactions on Graphics, 37(4), 2018

    Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example guided deep reinforcement learning of physics based character skills.ACM Transactions on Graphics, 37(4), 2018. 12

  27. [27]

    Mcp: Learning composable hierarchical control with multiplicative compositional policies.Advances in neural information processing systems, 32, 2019

    Xue Bin Peng, Michael Chang, Grace Zhang, Pieter Abbeel, and Sergey Levine. Mcp: Learning composable hierarchical control with multiplicative compositional policies.Advances in neural information processing systems, 32, 2019

  28. [28]

    Amp: Adversarial motion priors for stylized physics based character control.ACM Transactions on Graphics, 40 (4), 2021

    Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics based character control.ACM Transactions on Graphics, 40 (4), 2021

  29. [29]

    Ase: Large scale reusable adversarial skill embeddings for physically simulated characters

    Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large scale reusable adversarial skill embeddings for physically simulated characters. InACM SIGGRAPH Conference Proceedings, 2022

  30. [30]

    Karl Pertsch, Youngwoon Lee, and Joseph J. Lim. Accelerating reinforcement learning with learned skill priors. InProceedings of the 2020 Conference on Robot Learning, volume 155 ofProceedings of Machine Learning Research, pages 188–204. PMLR, 2021. URL https://proceedings.mlr.press/v155/pertsch21a.html

  31. [31]

    Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J. Guibas. HuMoR: 3d human motion model for robust pose estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11468–11479, 2021. doi: 10.1109/ICCV48922.2021.01129

  32. [32]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  33. [33]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 627–635. PMLR, 2011

  34. [34]

    Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, V olodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell

    Andrei A. Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, V olodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation.arXiv preprint arXiv:1511.06295, 2015

  35. [35]

    Richard A. Schmidt. A schema theory of discrete motor skill learning.Psychological Review, 82(4):225–260, 1975

  36. [36]

    Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Woj- ciech M

    Simon Schmitt, Jonathan J. Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Woj- ciech M. Czarnecki, Joel Z. Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, and S. M. Ali Eslami. Kickstarting deep reinforcement learning.arXiv preprint arXiv:1803.03835, 2018

  37. [37]

    Sutton, Doina Precup, and Satinder Singh

    Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999. doi: 10.1016/S0004-3702(99)00052-1

  38. [38]

    Calm: Conditional adversarial latent models for directable virtual characters

    Chen Tessler, Yoni Kasten, Yunrong Guo, Shie Mannor, Gal Chechik, and Xue Bin Peng. Calm: Conditional adversarial latent models for directable virtual characters. InACM SIGGRAPH Conference Proceedings, 2023

  39. [39]

    Maskedmimic: Unified physics-based character control through masked motion inpainting.ACM Transactions on Graphics, 43(6), 2024

    Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified physics-based character control through masked motion inpainting.ACM Transactions on Graphics, 43(6), 2024. doi: 10.1145/3687951

  40. [40]

    Zero shot whole body humanoid control via behavioral foundation models.arXiv preprint arXiv:2504.11054, 2025

    Andrea Tirinzoni, Ahmed Touati, Jesse Farebrother, Mateusz Guzek, Anssi Kanervisto, Yingchen Xu, Alessandro Lazaric, and Matteo Pirotta. Zero shot whole body humanoid control via behavioral foundation models.arXiv preprint arXiv:2504.11054, 2025

  41. [41]

    Emanuel Todorov and Michael I. Jordan. Optimal feedback control as a theory of motor coordination.Nature Neuroscience, 5(11):1226–1235, 2002. 13

  42. [42]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  43. [43]

    Feudal networks for hierarchical reinforcement learning

    Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3540–3549. PMLR, 2017. URL https: //proceedi...

  44. [44]

    Greg Wayne and L. F. Abbott. Hierarchical control using networks trained with higher-level forward models.Neural Computation, 26(10):2163–2193, 2014. doi: 10.1162/NECO_a_00639

  45. [45]

    Physics based character controllers using conditional vaes.ACM Transactions on Graphics, 41(4), 2022

    Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Physics based character controllers using conditional vaes.ACM Transactions on Graphics, 41(4), 2022

  46. [46]

    Controlvae: Model based learning of generative controllers for physics based characters.arXiv preprint arXiv:2210.06063, 2022

    Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. Controlvae: Model based learning of generative controllers for physics based characters.arXiv preprint arXiv:2210.06063, 2022

  47. [47]

    Understanding neural networks through deep visualization.arXiv preprint arXiv:1506.06579, 2015

    Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization.arXiv preprint arXiv:1506.06579, 2015

  48. [48]

    Olshausen and Yann LeCun , title =

    Zeyu Yun, Yubei Chen, Bruno Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: Contextualized embedding as a linear superposition of transformer factors. InProceedings of Deep Learning Inside Out: The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 1–10, Online, June 2021. Association ...

  49. [49]

    Visualizing and understanding convolutional networks

    Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision – ECCV 2014, volume 8689 ofLecture Notes in Computer Science, pages 818–833. Springer, 2014. doi: 10.1007/978-3-319-10590-1_53. URL https://doi.org/10. 1007/978-3-319-10590-1_53

  50. [50]

    Neural categorical priors for physics-based character control.ACM Transactions on Graphics, 42(6), 2023

    Qingxu Zhu, He Zhang, Mengting Lan, and Lei Han. Neural categorical priors for physics-based character control.ACM Transactions on Graphics, 42(6), 2023. doi: 10.1145/3618397. 14 A Method Details This appendix provides additional details for the construction of the recursive action interfaces and for downstream reinforcement learning with temporally exten...