arxiv: 2605.13058 · v1 · submitted 2026-05-13 · 💻 cs.RO

Recognition: unknown

MUJICA: Multi-skill Unified Joint Integration of Control Architecture for Wheeled-Legged Robots

Yuqi Li , Peng Zhai , Yueqi Zhang , Xiaoyi Wei , Quancheng Qian , Zhengxu He , Qianxiang Yu , Lihua Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:08 UTC · model grok-4.3

classification 💻 cs.RO

keywords wheeled-legged robotsproprioceptive controlmulti-skill policysim-to-real transferlocomotion controlunified architectureadaptive robotics

0 comments

The pith

A single policy integrates omnidirectional moving, climbing, and fall recovery for wheeled-legged robots using only proprioceptive sensing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MUJICA as a unified control framework that combines multiple low-level skills into one policy for wheeled-legged robots. Skills are trained jointly using indicator variables to distinguish them, along with detailed modeling of motor constraints in simulation. A high-level selector then chooses the right skill based solely on the robot's internal sensors, allowing adaptation to the environment. This design aims to achieve better transfer from simulation to reality and smooth transitions between different ways of moving, which matters because it could let such robots handle varied terrains more independently.

Core claim

MUJICA is a fully proprioceptive control architecture that trains diverse skills—omnidirectional moving, high platform climbing, and fall recovery—jointly within a single policy distinguished by unique indicator variables, incorporating accurate DC-motor constraint modeling, and uses a learned high-level skill selector to dynamically choose the optimal skill based on proprioceptions alone.

What carries the argument

The unified policy with skill indicator variables and a proprioceptive high-level skill selector that enables adaptive locomotion mode selection.

If this is right

The robot achieves seamless transitions across locomotion modes in response to environmental changes.
Sim-to-real robustness is enhanced through the joint training and motor modeling.
Autonomous adjustment to unstructured environments becomes possible without external sensing.
Task success rates improve in complex terrains on hardware like the Unitree Go2-W.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could simplify control systems by eliminating the need for multiple separate policies in hybrid robots.
Similar joint training approaches might apply to other robot types requiring multiple behaviors.
Testing on additional tasks such as manipulation during locomotion could reveal further capabilities.

Load-bearing premise

The assumption that joint training in simulation with indicator variables and accurate DC-motor models is sufficient to enable robust real-world performance and reliable skill selection without any real-world fine-tuning or additional sensors.

What would settle it

A real-world experiment where the robot encounters a high platform and either successfully climbs using the selected skill or fails to do so, or fails to recover from a fall using proprioception alone.

Figures

Figures reproduced from arXiv: 2605.13058 by Lihua Zhang, Peng Zhai, Qianxiang Yu, Quancheng Qian, Xiaoyi Wei, Yueqi Zhang, Yuqi Li, Zhengxu He.

**Figure 2.** Figure 2: An overview of MUJICA framework. Each task is associated with a unique skill indicator. During Training S1, the state estimator learns to predict [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) The velocity-dependent torque limit of all joints. Calf joint limits [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Success rate of different algorithms on all difficulty levels of representative tasks. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Torque–velocity distributions of thigh joints across diverse terrains. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: High platform climbing without DC-Motor constraints. The rear [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Real-world evaluations of individual skills. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Wheeled-legged robots hold promise for traversing complex terrains and offer superior mobility compared to legged robots. However, wheeled-legged robots must effectively balance both wheeled driving and legged control. Furthermore, due to noisy proprioceptive sensing and real-world motor constraints, realizing robust and adaptive locomotion at peak performance of motors remains challenging. We propose the Multi-skill Unified Joint Integration of Control Architecture (MUJICA), a unified, fully proprioceptive control framework for wheeled-legged robots that integrates diverse low-level skills-including omnidirectional moving, high platform climbing, and fall recovery-within a single policy. All skills, distinguished by unique indicator variables, are trained jointly with accurate DC-motor constraint modeling. Additionally, a high-level skill selector is learned to dynamically choose the optimal skill based solely on proprioceptions, enabling adaptive responses to the surrounding environment. Therefore, MUJICA enhances sim-to-real robustness and enables seamless transitions across diverse locomotion modes, facilitating autonomous adjustment to the environment. We validate our framework in both simulation and real-world experiments on the Unitree Go2-W robot, demonstrating significant improvements in adaptability and task success in unstructured environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MUJICA shows a single policy handling multiple skills on a wheeled-legged robot via joint RL with indicators and a proprioceptive selector, but the sim-to-real claims lack the quantitative backing needed to judge real gains.

read the letter

The main point is a unified policy that trains omnidirectional driving, high-platform climbing, and fall recovery together on the Unitree Go2-W by using skill indicator variables and a high-level selector that works from proprioception only. They add explicit DC-motor modeling in simulation to make the training more realistic. This setup lets the robot switch modes without external sensors or separate controllers, which is a practical step for hybrid platforms that need to handle both wheeled and legged behaviors in one stack. The real-robot tests are the part that gives the work its weight, since they move beyond pure simulation claims. The architecture itself is straightforward to follow and builds directly on existing RL-for-locomotion ideas without overcomplicating the selector. The soft spot is the missing experimental detail. The abstract mentions validation but gives no baselines against separate policies, no ablation numbers on the joint training or the selector, and no error statistics or success rates under noise. That makes it difficult to tell how much the unified approach actually improves over simpler alternatives. The sim-to-real transfer also rests on the assumption that accurate motor modeling plus joint training is enough, yet the paper does not report domain randomization, real-world adaptation, or how the selector behaves when proprioception is noisy. Those gaps are the main reason the robustness claim feels under-supported right now. This paper is for groups already working on wheeled-legged hardware and multi-skill RL controllers. A reader who needs a concrete template for combining skills in one policy would get usable ideas from the architecture section. It deserves peer review because the platform and the problem are relevant, and the full experimental section could turn the current sketch into something solid with the right comparisons added.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MUJICA, a unified proprioceptive control framework for wheeled-legged robots that jointly trains a single policy integrating multiple skills (omnidirectional locomotion, high-platform climbing, and fall recovery) distinguished by indicator variables, incorporates accurate DC-motor modeling in simulation, and learns a high-level selector that chooses skills based solely on proprioceptive inputs. The framework is evaluated in simulation and on the physical Unitree Go2-W platform, with claims of improved sim-to-real robustness and seamless skill transitions in unstructured environments.

Significance. If the empirical claims hold under rigorous quantitative scrutiny, the work would offer a practical route to versatile hybrid locomotion without maintaining separate policies or relying on external sensing, addressing a recognized gap in wheeled-legged control. The joint-training approach with motor constraints and proprioceptive selection is a concrete contribution that could generalize to other multi-modal platforms.

major comments (2)

[§4] §4 (Experimental validation): the abstract and results claim significant improvements in adaptability and task success on the Unitree Go2-W, yet no quantitative baselines, ablation studies on the indicator variables or selector, success-rate statistics, or error metrics versus prior methods are reported; this absence prevents verification that the unified policy and proprioceptive selector outperform simpler alternatives or that gains are not due to post-hoc tuning.
[§3.2] §3.2 (High-level skill selector): the central sim-to-real claim rests on the selector learning reliable skill choice from noisy proprioception alone after joint training; the manuscript provides no explicit domain-randomization schedule, noise model details, or real-world adaptation procedure, leaving open whether the selector generalizes beyond simulation artifacts.

minor comments (2)

Notation for the skill indicator variables is introduced without a compact table summarizing their values and corresponding behaviors; adding such a table would improve readability.
Figure captions for the real-robot experiments should explicitly state the number of trials and environmental variations tested rather than relying on qualitative descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the empirical validation and methodological details.

read point-by-point responses

Referee: [§4] §4 (Experimental validation): the abstract and results claim significant improvements in adaptability and task success on the Unitree Go2-W, yet no quantitative baselines, ablation studies on the indicator variables or selector, success-rate statistics, or error metrics versus prior methods are reported; this absence prevents verification that the unified policy and proprioceptive selector outperform simpler alternatives or that gains are not due to post-hoc tuning.

Authors: We agree that the current manuscript would benefit from more rigorous quantitative analysis to support the claims of improved adaptability and task success. In the revised version, we will add: success-rate statistics and error metrics from both simulation and real-world experiments on the Unitree Go2-W; ablation studies isolating the effects of the indicator variables and the high-level selector; and comparisons against baselines including separate skill-specific policies and variants without proprioceptive selection. These additions will provide direct evidence that the unified joint-training approach outperforms simpler alternatives. revision: yes
Referee: [§3.2] §3.2 (High-level skill selector): the central sim-to-real claim rests on the selector learning reliable skill choice from noisy proprioception alone after joint training; the manuscript provides no explicit domain-randomization schedule, noise model details, or real-world adaptation procedure, leaving open whether the selector generalizes beyond simulation artifacts.

Authors: We acknowledge that additional details on the training and transfer process are necessary. The revised manuscript will expand §3.2 to include the full domain-randomization schedule (terrain, friction, and motor parameter ranges), the specific noise models applied to proprioceptive observations during training, and the real-world deployment procedure, which relies on the robustness induced by joint training with accurate DC-motor constraints rather than online adaptation. These clarifications will better substantiate the sim-to-real generalization of the proprioceptive selector. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training and validation chain is self-contained

full rationale

The paper proposes an empirical control architecture (MUJICA) that jointly trains a single policy in simulation using indicator variables to distinguish skills and accurate DC-motor modeling, then learns a proprioceptive high-level selector, with final claims resting on sim-to-real experiments on the Unitree Go2-W. No derivation chain, equations, or predictions reduce to inputs by construction; there are no self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations that close the loop. The framework is validated against external physical benchmarks rather than internal fits, making the result independent of any circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework assumes standard reinforcement learning convergence under proprioceptive observations and that accurate DC-motor modeling in simulation captures the dominant real-world discrepancies; no new physical entities are postulated.

free parameters (1)

RL reward weights and skill indicator scaling
Typical hyperparameters in multi-task RL that must be tuned to balance the different locomotion modes.

axioms (2)

domain assumption Proprioceptive signals alone are sufficient to distinguish and select among locomotion skills in unstructured environments
Invoked when the high-level selector is trained and deployed without external perception.
domain assumption Accurate DC-motor constraint modeling in simulation closes the sim-to-real gap for peak-performance locomotion
Central modeling choice stated in the abstract as enabling robust transfer.

pith-pipeline@v0.9.0 · 5525 in / 1380 out tokens · 47905 ms · 2026-05-14T19:08:31.581622+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Deep reinforcement learning for robotics: A survey of real- world successes,

C. Tang, B. Abbatematteo, J. Hu, R. Chandra, R. Mart ´ın-Mart´ın, and P. Stone, “Deep reinforcement learning for robotics: A survey of real- world successes,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 27, 2025, pp. 28 694–28 698

2025
[2]

Moe-loco: Mixture of experts for multitask locomotion,

R. Huang, S. Zhu, Y . Du, and H. Zhao, “Moe-loco: Mixture of experts for multitask locomotion,”arXiv preprint arXiv:2503.08564, 2025

work page arXiv 2025
[3]

Deep reinforcement learning in mixture of experts control system for blind wheeled-legged quadrupedal locomo- tion,

W. Zhang and K. Wang, “Deep reinforcement learning in mixture of experts control system for blind wheeled-legged quadrupedal locomo- tion,” in2024 International Conference on Advanced Robotics and Intelligent Systems (ARIS). IEEE, 2024, pp. 1–5

2024
[4]

Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning,

I. Nahrendra, B. Yu, and H. Myung, “Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning,”arXiv preprint arXiv:2301.10602, 2023

work page arXiv 2023
[5]

Fr-net: Learning robust quadrupedal fall recovery on challenging terrains through mass-contact prediction,

Y . Lu, Y . Dong, J. Zhang, J. Ma, and P. Lu, “Fr-net: Learning robust quadrupedal fall recovery on challenging terrains through mass-contact prediction,”IEEE Robotics and Automation Letters, 2025

2025
[6]

Renet: Fault-tolerant motion control for quadruped robots via redun- dant estimator networks under visual collapse,

Y . Zhang, Q. Qian, T. Hou, P. Zhai, X. Wei, K. Hu, J. Yi, and L. Zhang, “Renet: Fault-tolerant motion control for quadruped robots via redun- dant estimator networks under visual collapse,”IEEE Robotics and Automation Letters, pp. 1–8, 2025

2025
[7]

Learning robust autonomous navigation and locomotion for wheeled- legged robots,

J. Lee, M. Bjelonic, A. Reske, L. Wellhausen, T. Miki, and M. Hutter, “Learning robust autonomous navigation and locomotion for wheeled- legged robots,”Science Robotics, vol. 9, no. 89, p. eadi9641, 2024

2024
[8]

Reinforcement learning for blind stair climbing with legged and wheeled-legged robots,

S. Chamorro, V . Klemm, M. d. L. I. Valls, C. Pal, and R. Siegwart, “Reinforcement learning for blind stair climbing with legged and wheeled-legged robots,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 8081–8087

2024
[9]

Mtac: Hierarchical reinforcement learning-based multi-gait terrain-adaptive quadruped controller,

N. Shah, K. Tiwari, and A. Bera, “Mtac: Hierarchical reinforcement learning-based multi-gait terrain-adaptive quadruped controller,”arXiv preprint arXiv:2401.03337, 2023

work page arXiv 2023
[10]

Allgaits: Learning all quadruped gaits and transitions,

G. Bellegarda, M. Shafiee, and A. Ijspeert, “Allgaits: Learning all quadruped gaits and transitions,” in2025 IEEE International Confer- ence on Robotics and Automation (ICRA). IEEE, 2025, pp. 15 929– 15 935

2025
[11]

Versatile skill control via self-supervised adversarial imitation of unlabeled mixed motions,

C. Li, S. Blaes, P. Kolev, M. Vlastelica, J. Frey, and G. Martius, “Versatile skill control via self-supervised adversarial imitation of unlabeled mixed motions,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 2944–2950

2023
[12]

Penalized proximal policy optimization for safe reinforcement learning,

L. Zhang, L. Shen, L. Yang, S. Chen, B. Yuan, X. Wang, and D. Tao, “Penalized proximal policy optimization for safe reinforcement learning,”arXiv preprint arXiv:2205.11814, 2022

work page arXiv 2022
[13]

Not only rewards but also constraints: Applications on legged robot locomotion,

Y . Kim, H. Oh, J. Lee, J. Choi, G. Ji, M. Jung, D. Youm, and J. Hwangbo, “Not only rewards but also constraints: Applications on legged robot locomotion,”IEEE Transactions on Robotics, vol. 40, pp. 2984–3003, 2024

2024
[14]

Guided constrained policy optimization for dynamic quadrupedal robot locomotion,

S. Gangapurwala, A. Mitchell, and I. Havoutis, “Guided constrained policy optimization for dynamic quadrupedal robot locomotion,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3642–3649, 2020

2020
[15]

Agile continuous jumping in discontinuous terrains,

Y . Yang, G. Shi, C. Lin, X. Meng, R. Scalise, M. G. Castro, W. Yu, T. Zhang, D. Zhao, J. Tanet al., “Agile continuous jumping in discontinuous terrains,”arXiv preprint arXiv:2409.10923, 2024

work page arXiv 2024
[16]

Sim-to-Real: Learning Agile Locomotion For Quadruped Robots

J. Tan, T. Zhang, E. Coumans, A. Iscen, Y . Bai, D. Hafner, S. Bo- hez, and V . Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,”arXiv preprint arXiv:1804.10332, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Rma: Rapid motor adaptation for legged robots

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,”arXiv preprint arXiv:2107.04034, 2021

work page arXiv 2021
[18]

Hybrid internal model: Learning agile legged locomotion with simulated robot response,

J. Long, Z. Wang, Q. Li, J. Gao, L. Cao, and J. Pang, “Hybrid internal model: Learning agile legged locomotion with simulated robot response,”arXiv preprint arXiv:2312.11460, 2023

work page arXiv 2023
[19]

Cat: Constraints as terminations for legged locomotion reinforcement learning,

E. Chane-Sane, P.-A. Leziart, T. Flayols, O. Stasse, P. Sou `eres, and N. Mansard, “Cat: Constraints as terminations for legged locomotion reinforcement learning,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 13 303– 13 310

2024
[20]

Dynamic object goal pushing with mobile manipulators through model-free constrained reinforcement learning,

I. Dadiotis, M. Mittal, N. Tsagarakis, and M. Hutter, “Dynamic object goal pushing with mobile manipulators through model-free constrained reinforcement learning,”arXiv preprint arXiv:2502.01546, 2025

work page arXiv 2025
[21]

Alarm: Safe reinforcement learning with reliable mimicry for robust legged locomotion,

Q. Zhou, H. Ding, T. Chen, L. Man, H. Jiang, G. Zhang, B. Li, X. Rong, and Y . Li, “Alarm: Safe reinforcement learning with reliable mimicry for robust legged locomotion,”IEEE Robotics and Automa- tion Letters, 2025

2025
[22]

Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning

E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Actor-mimic: Deep multitask and transfer reinforcement learning,”arXiv preprint arXiv:1511.06342, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[23]

Distral: Robust multitask reinforcement learning,

Y . Teh, V . Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu, “Distral: Robust multitask reinforcement learning,”Advances in neural information processing systems, vol. 30, 2017

2017
[24]

Multi-expert learning of adaptive legged locomotion,

C. Yang, K. Yuan, Q. Zhu, W. Yu, and Z. Li, “Multi-expert learning of adaptive legged locomotion,”Science Robotics, vol. 5, no. 49, p. eabb2174, 2020

2020
[25]

Discovery of skill switching criteria for learning agile quadruped locomotion,

W. Yu, F. Acero, V . Atanassov, C. Yang, I. Havoutis, D. Kanoulas, and Z. Li, “Discovery of skill switching criteria for learning agile quadruped locomotion,”arXiv preprint arXiv:2502.06676, 2025

work page arXiv 2025
[26]

Versatile skill control via self-supervised adversarial imitation of unlabeled mixed motions,

C. Li, S. Blaes, P. Kolev, M. Vlastelica, J. Frey, and G. Martius, “Versatile skill control via self-supervised adversarial imitation of unlabeled mixed motions,”arXiv preprint arXiv:2209.07899, 2022

work page arXiv 2022
[27]

Unsupervised learning of visual features by contrasting cluster assign- ments,

M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assign- ments,”Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020

2020
[28]

Asymmetric Actor Critic for Image-Based Robot Learning

L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric actor critic for image-based robot learning,”arXiv preprint arXiv:1710.06542, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Actuator-constrained reinforcement learning for high-speed quadrupedal locomotion,

Y .-H. Shin, T.-G. Song, G. Ji, and H.-W. Park, “Actuator-constrained reinforcement learning for high-speed quadrupedal locomotion,”arXiv preprint arXiv:2312.17507, 2023

work page arXiv 2023
[31]

Learning to walk in minutes using massively parallel deep reinforcement learning,

N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Conference on robot learning. PMLR, 2022, pp. 91–100

2022