pith. sign in

arxiv: 2505.19054 · v2 · submitted 2025-05-25 · 💻 cs.LG

RANDPOL: Parameter-Efficient End-to-End Quadruped Locomotion via Randomized Policy Learning

Pith reviewed 2026-05-19 13:42 UTC · model grok-4.3

classification 💻 cs.LG
keywords quadruped locomotionparameter-efficient controlrandomized policy learningsim-to-real transferreinforcement learningUnitree Go2
0
0 comments X

The pith

RANDPOL achieves comparable quadruped locomotion by training only the final linear layer while fixing random hidden features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether end-to-end quadruped locomotion control requires fully trainable deep networks with many parameters. It introduces RANDPOL, which randomly initializes and freezes the hidden layers of the actor and critic, training solely the output readout. This design draws on randomized function approximation to preserve nonlinear capability while shrinking the trainable space. Experiments on the Unitree Go2 show performance close to standard PPO alongside reduced parameters, faster per-iteration training, and direct sim-to-real transfer under velocity and yaw commands. The results indicate that structured control tasks can succeed without optimizing the entire network.

Core claim

RANDPOL attains comparative locomotion performance with far fewer trainable parameters, lower learning-phase computation time per iteration, and a favorable performance-complexity trade-off. We further demonstrate successful zero-shot sim-to-real transfer of the learned RANDPOL controller on the physical Unitree Go2 under user-issued forward-velocity and yaw-rate commands.

What carries the argument

Fixed random nonlinear features in the hidden layers of the actor and critic networks, which supply an expressive random basis so that only the final linear readout needs training.

If this is right

  • Training time and memory per iteration drop because gradients are computed only for the readout weights.
  • The same fixed-random architecture supports zero-shot transfer from simulation to the physical Unitree Go2 robot.
  • Reducing trainable complexity remains compatible with effective simulated and real-world quadruped performance.
  • The approach offers a favorable performance-complexity trade-off relative to fully trainable PPO baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fixed-random readouts could be tested on other structured tasks such as manipulation or navigation where full network training is costly.
  • Hardware with limited on-device training might run RANDPOL-style controllers after offline readout optimization.
  • The method invites direct comparison with reservoir-computing or extreme-learning-machine variants on the same locomotion benchmarks.

Load-bearing premise

Fixed random nonlinear features supply enough expressiveness for structured quadruped locomotion without any training of the hidden layers.

What would settle it

A direct comparison showing that RANDPOL cannot match PPO on key locomotion metrics such as forward speed, stability, or energy efficiency even after scaling the number of random features.

Figures

Figures reproduced from arXiv: 2505.19054 by Quan Nguyen, Rahul Jain, Zhuochen Liu.

Figure 1
Figure 1. Figure 1: Representative simulation environments used for [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training curves on the Go2 forward-and-yaw velocity-tracking task. Each plot shows the mean across five runs [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative real-world evaluation conditions for [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative hardware yaw-rate tracking trajectories [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Modern learning-based locomotion controllers typically rely on fully trainable deep neural networks with a large number of parameters. This paper studies a different design point for end-to-end control: whether effective quadruped locomotion can be achieved with a drastically reduced trainable parameter space. We present RANDomized POlicy Learning (RANDPOL), a policy learning approach in which the hidden layers of the actor and critic are randomly initialized and fixed, while only the final linear readout is trained. This yields a parameter-efficient controller class that retains nonlinear expressiveness through a fixed random basis while substantially reducing the dimension of the optimization problem. RANDPOL is supported by the mathematical foundation of randomized function approximation, which provides a principled basis for using fixed random nonlinear features as expressive function classes. We evaluate RANDPOL on end-to-end locomotion control for the Unitree Go2 quadruped and compare it with Proximal Policy Optimization (PPO). The results show that RANDPOL attains comparative locomotion performance with far fewer trainable parameters, lower learning-phase computation time per iteration, and a favorable performance-complexity trade-off. We further demonstrate successful zero-shot sim-to-real transfer of the learned RANDPOL controller on the physical Unitree Go2 under user-issued forward-velocity and yaw-rate commands. These results indicate that, for structured robotic control problems, reducing trainable complexity can remain compatible with effective simulated and real-world performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces RANDPOL, a policy learning method for end-to-end quadruped locomotion in which the hidden layers of the actor and critic are randomly initialized and fixed while only the final linear readout is trained. It claims that this yields locomotion performance comparable to standard PPO on the Unitree Go2, with substantially fewer trainable parameters, lower per-iteration computation during learning, and a favorable performance-complexity trade-off, together with successful zero-shot sim-to-real transfer under forward-velocity and yaw-rate commands. The approach is justified by randomized function approximation theory.

Significance. If the empirical claims hold under rigorous verification, the work would be significant for showing that fixed random nonlinear features can supply adequate expressiveness for high-dimensional structured control tasks such as quadruped locomotion. This could reduce training cost and memory footprint for RL-based controllers while preserving sim-to-real capability, and it provides a concrete test of randomized approximation ideas in a robotics setting.

major comments (2)
  1. [Method] Method section: the paper does not specify the hidden-layer width, the distribution used to draw the random weights and biases, or any kernel-alignment or approximation-quality diagnostic. These details are load-bearing for the central claim that fixed random features provide a sufficiently rich basis for both actor and critic without hidden-layer training.
  2. [Experiments] Experiments and Results sections: the claim of 'comparative locomotion performance' with PPO is stated without quantitative metrics (e.g., velocity tracking error, success rate, or return), statistical details across seeds, or explicit description of how the PPO baseline network size and training budget were matched. This leaves the performance-complexity trade-off assertion with limited verifiable support.
minor comments (1)
  1. [Abstract] Abstract: the reductions in trainable parameters and per-iteration compute time are described qualitatively ('far fewer', 'lower') but not quantified, which would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [Method] Method section: the paper does not specify the hidden-layer width, the distribution used to draw the random weights and biases, or any kernel-alignment or approximation-quality diagnostic. These details are load-bearing for the central claim that fixed random features provide a sufficiently rich basis for both actor and critic without hidden-layer training.

    Authors: We agree that these implementation details are necessary to substantiate the claim that fixed random features suffice for the task. In the revised manuscript we will explicitly report the hidden-layer width (256 units), the weight initialization distribution (zero-mean Gaussian with variance 1 over input dimension, biases initialized to zero), and a short discussion of the randomized function approximation theory that underpins the expressiveness of the fixed basis for both actor and critic. revision: yes

  2. Referee: [Experiments] Experiments and Results sections: the claim of 'comparative locomotion performance' with PPO is stated without quantitative metrics (e.g., velocity tracking error, success rate, or return), statistical details across seeds, or explicit description of how the PPO baseline network size and training budget were matched. This leaves the performance-complexity trade-off assertion with limited verifiable support.

    Authors: We acknowledge that additional quantitative detail would strengthen the presentation. The revised Experiments and Results sections will include explicit metrics (velocity tracking error, success rate, and return), report means and standard deviations across five independent seeds, and clarify that the PPO baseline uses the same hidden-layer dimensions with all parameters trainable while matching the total number of environment steps. revision: yes

Circularity Check

0 steps flagged

No circularity: external randomized approximation theory plus empirical validation

full rationale

The paper's core design—fixing random hidden layers and training only the final linear readout—is justified by reference to the established external theory of randomized function approximation rather than any internal derivation or fit. Performance claims rest on direct empirical comparison to PPO, including sim-to-real transfer metrics on the Unitree Go2, without renaming fitted quantities as predictions or invoking self-citations whose content reduces to the present work. The derivation chain therefore remains self-contained against independent mathematical results and experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the external theory of randomized function approximation to guarantee that fixed random hidden layers remain expressive for locomotion; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Randomized function approximation supplies a principled basis for using fixed random nonlinear features as expressive function classes for control policies.
    Invoked in the abstract to underwrite the decision to freeze hidden layers.

pith-pipeline@v0.9.0 · 5775 in / 1198 out tokens · 65353 ms · 2026-05-19T13:42:42.241301+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 4 internal anchors

  1. [1]

    Mnih, V olodymyr, et al., Human-level control through deep reinforce- ment learning,nature, 518.7540 (2015): 529-533

  2. [2]

    1889-1897

    Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P., Trust region policy optimization,International conference on machine learning, pp. 1889-1897. PMLR, 2015

  3. [3]

    Mnih, V olodymyr, et al., Asynchronous methods for deep reinforce- ment learning,International conference on machine learning, PmLR, 2016

  4. [4]

    1861-1870

    Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S., Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,International conference on machine learning, pp. 1861-1870. Pmlr. 2018

  5. [5]

    Chou, Po-Wei, Daniel Maturana, and Sebastian Scherer, Improving stochastic policy gradients in continuous control with deep reinforce- ment learning using the beta distribution,International conference on machine learning, pp. 834-843. PMLR, 2017

  6. [6]

    Silver, David, et al., Mastering the game of Go with deep neural networks and tree search,nature, 529.7587 (2016): 484-489

  7. [7]

    Silver, David, et al., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play,Science, 362.6419 (2018): 1140-1144

  8. [8]

    Bellegarda, G., Chen, Y ., Liu, Z., & Nguyen, Q., Robust high- speed running for quadruped robots via deep reinforcement learning, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022. pp. 10364-10370. IEEE

  9. [9]

    Bellegarda, Guillaume, and Auke Ijspeert., CPG-RL: Learning central pattern generators for quadruped locomotion,IEEE Robotics and Automation Letters, 2022, pp 12547-12554

  10. [10]

    Kumar, A., Fu, Z., Pathak, D., & Malik, J., Rma: Rapid motor adaptation for legged robots,arXiv preprint, 2021, arXiv:2107.04034

  11. [11]

    Siekmann, J., Green, K., Warila, J., Fern, A., & Hurst, J., Blind bipedal stair traversal via sim-to-real reinforcement learning,arXiv preprint, 2021, arXiv:2105.08328

  12. [12]

    46, 2008, pp 265-271

    Rahimi, Ali, and Benjamin Recht., Uniform approximation of func- tions with random bases,allerton conference on communication, control, and computing, vol. 46, 2008, pp 265-271. IEEE

  13. [13]

    21, 2008

    Rahimi, Ali, and Benjamin Recht., Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,Ad- vances in neural information processing systems, vol. 21, 2008

  14. [14]

    Gallicchio, Claudio, and Simone Scardapane., Deep randomized neural networks,Recent Trends in Learning From Data: Tutorials from the INNS Big Data and Deep Learning Conference (INNSBDDL2019), 2020, pp 43-68, Cham: Springer International Publishing,

  15. [15]

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O., Proximal policy optimization algorithms,arXiv preprint, 2017, arXiv:1707.06347

  16. [16]

    Mittal, Mayank, et al., Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning,arXiv preprint, 2025, arXiv:2511.04831

  17. [17]

    Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P., High- dimensional continuous control using generalized advantage estima- tion,arXiv preprint, 2015, arXiv:1506.02438

  18. [18]

    Gallicchio, Claudio, Alessio Micheli, and Luca Pedrelli., Deep reser- voir computing: A critical experimental analysis,Neurocomputing, 268, 2017, pp 87-99

  19. [19]

    Huang, Guang-Bin, Qin-Yu Zhu, and Chee-Kheong Siew., Extreme learning machine: theory and applications,Neurocomputing, 70.1-3, 2006, pp 489-501

  20. [20]

    Suganthan, Ponnuthurai N., and Rakesh Katuwal., On the origins of randomization-based feedforward neural networks,Applied Soft Computing, 105, 2021, 107239

  21. [21]

    Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V ., Koltun, V ., & Hutter, M., Learning agile and dynamic motor skills for legged robots,Science robotics, 2019, eaau5872

  22. [22]

    Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V ., & Hutter, M., Learning quadrupedal locomotion over challenging terrain,Science robotics, 2020, eabc5986

  23. [23]

    Rudin, N., Hoeller, D., Reist, P., & Hutter, M., Learning to walk in minutes using massively parallel deep reinforcement learning, Conference on robot learning, 2022, pp 91-100. PMLR

  24. [24]

    Hoeller, D., Rudin, N., Sako, D., & Hutter, M., Anymal parkour: Learning agile navigation for quadrupedal robots,Science Robotics, 9.88, 2024, eadi7566

  25. [25]

    Schwarke, C., Mittal, M., Rudin, N., Hoeller, D., & Hutter, M., Rsl- rl: A learning library for robotics research,arXiv preprint, 2025, arXiv:2509.10771