RANDPOL: Parameter-Efficient End-to-End Quadruped Locomotion via Randomized Policy Learning

Quan Nguyen; Rahul Jain; Zhuochen Liu

arxiv: 2505.19054 · v2 · submitted 2025-05-25 · 💻 cs.LG

RANDPOL: Parameter-Efficient End-to-End Quadruped Locomotion via Randomized Policy Learning

Zhuochen Liu , Rahul Jain , Quan Nguyen This is my paper

Pith reviewed 2026-05-19 13:42 UTC · model grok-4.3

classification 💻 cs.LG

keywords quadruped locomotionparameter-efficient controlrandomized policy learningsim-to-real transferreinforcement learningUnitree Go2

0 comments

The pith

RANDPOL achieves comparable quadruped locomotion by training only the final linear layer while fixing random hidden features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether end-to-end quadruped locomotion control requires fully trainable deep networks with many parameters. It introduces RANDPOL, which randomly initializes and freezes the hidden layers of the actor and critic, training solely the output readout. This design draws on randomized function approximation to preserve nonlinear capability while shrinking the trainable space. Experiments on the Unitree Go2 show performance close to standard PPO alongside reduced parameters, faster per-iteration training, and direct sim-to-real transfer under velocity and yaw commands. The results indicate that structured control tasks can succeed without optimizing the entire network.

Core claim

RANDPOL attains comparative locomotion performance with far fewer trainable parameters, lower learning-phase computation time per iteration, and a favorable performance-complexity trade-off. We further demonstrate successful zero-shot sim-to-real transfer of the learned RANDPOL controller on the physical Unitree Go2 under user-issued forward-velocity and yaw-rate commands.

What carries the argument

Fixed random nonlinear features in the hidden layers of the actor and critic networks, which supply an expressive random basis so that only the final linear readout needs training.

If this is right

Training time and memory per iteration drop because gradients are computed only for the readout weights.
The same fixed-random architecture supports zero-shot transfer from simulation to the physical Unitree Go2 robot.
Reducing trainable complexity remains compatible with effective simulated and real-world quadruped performance.
The approach offers a favorable performance-complexity trade-off relative to fully trainable PPO baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar fixed-random readouts could be tested on other structured tasks such as manipulation or navigation where full network training is costly.
Hardware with limited on-device training might run RANDPOL-style controllers after offline readout optimization.
The method invites direct comparison with reservoir-computing or extreme-learning-machine variants on the same locomotion benchmarks.

Load-bearing premise

Fixed random nonlinear features supply enough expressiveness for structured quadruped locomotion without any training of the hidden layers.

What would settle it

A direct comparison showing that RANDPOL cannot match PPO on key locomotion metrics such as forward speed, stability, or energy efficiency even after scaling the number of random features.

Figures

Figures reproduced from arXiv: 2505.19054 by Quan Nguyen, Rahul Jain, Zhuochen Liu.

**Figure 2.** Figure 2: Training curves on the Go2 forward-and-yaw velocity-tracking task. Each plot shows the mean across five runs [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Representative real-world evaluation conditions for [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Representative hardware yaw-rate tracking trajectories [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Modern learning-based locomotion controllers typically rely on fully trainable deep neural networks with a large number of parameters. This paper studies a different design point for end-to-end control: whether effective quadruped locomotion can be achieved with a drastically reduced trainable parameter space. We present RANDomized POlicy Learning (RANDPOL), a policy learning approach in which the hidden layers of the actor and critic are randomly initialized and fixed, while only the final linear readout is trained. This yields a parameter-efficient controller class that retains nonlinear expressiveness through a fixed random basis while substantially reducing the dimension of the optimization problem. RANDPOL is supported by the mathematical foundation of randomized function approximation, which provides a principled basis for using fixed random nonlinear features as expressive function classes. We evaluate RANDPOL on end-to-end locomotion control for the Unitree Go2 quadruped and compare it with Proximal Policy Optimization (PPO). The results show that RANDPOL attains comparative locomotion performance with far fewer trainable parameters, lower learning-phase computation time per iteration, and a favorable performance-complexity trade-off. We further demonstrate successful zero-shot sim-to-real transfer of the learned RANDPOL controller on the physical Unitree Go2 under user-issued forward-velocity and yaw-rate commands. These results indicate that, for structured robotic control problems, reducing trainable complexity can remain compatible with effective simulated and real-world performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces RANDPOL, a policy learning method for end-to-end quadruped locomotion in which the hidden layers of the actor and critic are randomly initialized and fixed while only the final linear readout is trained. It claims that this yields locomotion performance comparable to standard PPO on the Unitree Go2, with substantially fewer trainable parameters, lower per-iteration computation during learning, and a favorable performance-complexity trade-off, together with successful zero-shot sim-to-real transfer under forward-velocity and yaw-rate commands. The approach is justified by randomized function approximation theory.

Significance. If the empirical claims hold under rigorous verification, the work would be significant for showing that fixed random nonlinear features can supply adequate expressiveness for high-dimensional structured control tasks such as quadruped locomotion. This could reduce training cost and memory footprint for RL-based controllers while preserving sim-to-real capability, and it provides a concrete test of randomized approximation ideas in a robotics setting.

major comments (2)

[Method] Method section: the paper does not specify the hidden-layer width, the distribution used to draw the random weights and biases, or any kernel-alignment or approximation-quality diagnostic. These details are load-bearing for the central claim that fixed random features provide a sufficiently rich basis for both actor and critic without hidden-layer training.
[Experiments] Experiments and Results sections: the claim of 'comparative locomotion performance' with PPO is stated without quantitative metrics (e.g., velocity tracking error, success rate, or return), statistical details across seeds, or explicit description of how the PPO baseline network size and training budget were matched. This leaves the performance-complexity trade-off assertion with limited verifiable support.

minor comments (1)

[Abstract] Abstract: the reductions in trainable parameters and per-iteration compute time are described qualitatively ('far fewer', 'lower') but not quantified, which would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Method] Method section: the paper does not specify the hidden-layer width, the distribution used to draw the random weights and biases, or any kernel-alignment or approximation-quality diagnostic. These details are load-bearing for the central claim that fixed random features provide a sufficiently rich basis for both actor and critic without hidden-layer training.

Authors: We agree that these implementation details are necessary to substantiate the claim that fixed random features suffice for the task. In the revised manuscript we will explicitly report the hidden-layer width (256 units), the weight initialization distribution (zero-mean Gaussian with variance 1 over input dimension, biases initialized to zero), and a short discussion of the randomized function approximation theory that underpins the expressiveness of the fixed basis for both actor and critic. revision: yes
Referee: [Experiments] Experiments and Results sections: the claim of 'comparative locomotion performance' with PPO is stated without quantitative metrics (e.g., velocity tracking error, success rate, or return), statistical details across seeds, or explicit description of how the PPO baseline network size and training budget were matched. This leaves the performance-complexity trade-off assertion with limited verifiable support.

Authors: We acknowledge that additional quantitative detail would strengthen the presentation. The revised Experiments and Results sections will include explicit metrics (velocity tracking error, success rate, and return), report means and standard deviations across five independent seeds, and clarify that the PPO baseline uses the same hidden-layer dimensions with all parameters trainable while matching the total number of environment steps. revision: yes

Circularity Check

0 steps flagged

No circularity: external randomized approximation theory plus empirical validation

full rationale

The paper's core design—fixing random hidden layers and training only the final linear readout—is justified by reference to the established external theory of randomized function approximation rather than any internal derivation or fit. Performance claims rest on direct empirical comparison to PPO, including sim-to-real transfer metrics on the Unitree Go2, without renaming fitted quantities as predictions or invoking self-citations whose content reduces to the present work. The derivation chain therefore remains self-contained against independent mathematical results and experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the external theory of randomized function approximation to guarantee that fixed random hidden layers remain expressive for locomotion; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Randomized function approximation supplies a principled basis for using fixed random nonlinear features as expressive function classes for control policies.
Invoked in the abstract to underwrite the decision to freeze hidden layers.

pith-pipeline@v0.9.0 · 5775 in / 1198 out tokens · 65353 ms · 2026-05-19T13:42:42.241301+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RANDPOL samples the hidden layers once, keeps them fixed throughout training, and optimizes only the final linear layer... supported by the mathematical foundation of randomized function approximation
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

random basis functions generated by a frozen randomized neural network

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 4 internal anchors

[1]

Mnih, V olodymyr, et al., Human-level control through deep reinforce- ment learning,nature, 518.7540 (2015): 529-533

work page 2015
[2]

1889-1897

Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P., Trust region policy optimization,International conference on machine learning, pp. 1889-1897. PMLR, 2015

work page 2015
[3]

Mnih, V olodymyr, et al., Asynchronous methods for deep reinforce- ment learning,International conference on machine learning, PmLR, 2016

work page 2016
[4]

1861-1870

Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S., Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,International conference on machine learning, pp. 1861-1870. Pmlr. 2018

work page 2018
[5]

Chou, Po-Wei, Daniel Maturana, and Sebastian Scherer, Improving stochastic policy gradients in continuous control with deep reinforce- ment learning using the beta distribution,International conference on machine learning, pp. 834-843. PMLR, 2017

work page 2017
[6]

Silver, David, et al., Mastering the game of Go with deep neural networks and tree search,nature, 529.7587 (2016): 484-489

work page 2016
[7]

Silver, David, et al., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play,Science, 362.6419 (2018): 1140-1144

work page 2018
[8]

Bellegarda, G., Chen, Y ., Liu, Z., & Nguyen, Q., Robust high- speed running for quadruped robots via deep reinforcement learning, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022. pp. 10364-10370. IEEE

work page 2022
[9]

Bellegarda, Guillaume, and Auke Ijspeert., CPG-RL: Learning central pattern generators for quadruped locomotion,IEEE Robotics and Automation Letters, 2022, pp 12547-12554

work page 2022
[10]

Kumar, A., Fu, Z., Pathak, D., & Malik, J., Rma: Rapid motor adaptation for legged robots,arXiv preprint, 2021, arXiv:2107.04034

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Siekmann, J., Green, K., Warila, J., Fern, A., & Hurst, J., Blind bipedal stair traversal via sim-to-real reinforcement learning,arXiv preprint, 2021, arXiv:2105.08328

work page arXiv 2021
[12]

46, 2008, pp 265-271

Rahimi, Ali, and Benjamin Recht., Uniform approximation of func- tions with random bases,allerton conference on communication, control, and computing, vol. 46, 2008, pp 265-271. IEEE

work page 2008
[13]

21, 2008

Rahimi, Ali, and Benjamin Recht., Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,Ad- vances in neural information processing systems, vol. 21, 2008

work page 2008
[14]

Gallicchio, Claudio, and Simone Scardapane., Deep randomized neural networks,Recent Trends in Learning From Data: Tutorials from the INNS Big Data and Deep Learning Conference (INNSBDDL2019), 2020, pp 43-68, Cham: Springer International Publishing,

work page 2020
[15]

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O., Proximal policy optimization algorithms,arXiv preprint, 2017, arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Mittal, Mayank, et al., Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning,arXiv preprint, 2025, arXiv:2511.04831

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P., High- dimensional continuous control using generalized advantage estima- tion,arXiv preprint, 2015, arXiv:1506.02438

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Gallicchio, Claudio, Alessio Micheli, and Luca Pedrelli., Deep reser- voir computing: A critical experimental analysis,Neurocomputing, 268, 2017, pp 87-99

work page 2017
[19]

Huang, Guang-Bin, Qin-Yu Zhu, and Chee-Kheong Siew., Extreme learning machine: theory and applications,Neurocomputing, 70.1-3, 2006, pp 489-501

work page 2006
[20]

Suganthan, Ponnuthurai N., and Rakesh Katuwal., On the origins of randomization-based feedforward neural networks,Applied Soft Computing, 105, 2021, 107239

work page 2021
[21]

Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V ., Koltun, V ., & Hutter, M., Learning agile and dynamic motor skills for legged robots,Science robotics, 2019, eaau5872

work page 2019
[22]

Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V ., & Hutter, M., Learning quadrupedal locomotion over challenging terrain,Science robotics, 2020, eabc5986

work page 2020
[23]

Rudin, N., Hoeller, D., Reist, P., & Hutter, M., Learning to walk in minutes using massively parallel deep reinforcement learning, Conference on robot learning, 2022, pp 91-100. PMLR

work page 2022
[24]

Hoeller, D., Rudin, N., Sako, D., & Hutter, M., Anymal parkour: Learning agile navigation for quadrupedal robots,Science Robotics, 9.88, 2024, eadi7566

work page 2024
[25]

Schwarke, C., Mittal, M., Rudin, N., Hoeller, D., & Hutter, M., Rsl- rl: A learning library for robotics research,arXiv preprint, 2025, arXiv:2509.10771

work page arXiv 2025

[1] [1]

Mnih, V olodymyr, et al., Human-level control through deep reinforce- ment learning,nature, 518.7540 (2015): 529-533

work page 2015

[2] [2]

1889-1897

Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P., Trust region policy optimization,International conference on machine learning, pp. 1889-1897. PMLR, 2015

work page 2015

[3] [3]

Mnih, V olodymyr, et al., Asynchronous methods for deep reinforce- ment learning,International conference on machine learning, PmLR, 2016

work page 2016

[4] [4]

1861-1870

Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S., Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,International conference on machine learning, pp. 1861-1870. Pmlr. 2018

work page 2018

[5] [5]

Chou, Po-Wei, Daniel Maturana, and Sebastian Scherer, Improving stochastic policy gradients in continuous control with deep reinforce- ment learning using the beta distribution,International conference on machine learning, pp. 834-843. PMLR, 2017

work page 2017

[6] [6]

Silver, David, et al., Mastering the game of Go with deep neural networks and tree search,nature, 529.7587 (2016): 484-489

work page 2016

[7] [7]

Silver, David, et al., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play,Science, 362.6419 (2018): 1140-1144

work page 2018

[8] [8]

Bellegarda, G., Chen, Y ., Liu, Z., & Nguyen, Q., Robust high- speed running for quadruped robots via deep reinforcement learning, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022. pp. 10364-10370. IEEE

work page 2022

[9] [9]

Bellegarda, Guillaume, and Auke Ijspeert., CPG-RL: Learning central pattern generators for quadruped locomotion,IEEE Robotics and Automation Letters, 2022, pp 12547-12554

work page 2022

[10] [10]

Kumar, A., Fu, Z., Pathak, D., & Malik, J., Rma: Rapid motor adaptation for legged robots,arXiv preprint, 2021, arXiv:2107.04034

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Siekmann, J., Green, K., Warila, J., Fern, A., & Hurst, J., Blind bipedal stair traversal via sim-to-real reinforcement learning,arXiv preprint, 2021, arXiv:2105.08328

work page arXiv 2021

[12] [12]

46, 2008, pp 265-271

Rahimi, Ali, and Benjamin Recht., Uniform approximation of func- tions with random bases,allerton conference on communication, control, and computing, vol. 46, 2008, pp 265-271. IEEE

work page 2008

[13] [13]

21, 2008

Rahimi, Ali, and Benjamin Recht., Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,Ad- vances in neural information processing systems, vol. 21, 2008

work page 2008

[14] [14]

Gallicchio, Claudio, and Simone Scardapane., Deep randomized neural networks,Recent Trends in Learning From Data: Tutorials from the INNS Big Data and Deep Learning Conference (INNSBDDL2019), 2020, pp 43-68, Cham: Springer International Publishing,

work page 2020

[15] [15]

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O., Proximal policy optimization algorithms,arXiv preprint, 2017, arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

Mittal, Mayank, et al., Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning,arXiv preprint, 2025, arXiv:2511.04831

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P., High- dimensional continuous control using generalized advantage estima- tion,arXiv preprint, 2015, arXiv:1506.02438

work page internal anchor Pith review Pith/arXiv arXiv 2015

[18] [18]

Gallicchio, Claudio, Alessio Micheli, and Luca Pedrelli., Deep reser- voir computing: A critical experimental analysis,Neurocomputing, 268, 2017, pp 87-99

work page 2017

[19] [19]

Huang, Guang-Bin, Qin-Yu Zhu, and Chee-Kheong Siew., Extreme learning machine: theory and applications,Neurocomputing, 70.1-3, 2006, pp 489-501

work page 2006

[20] [20]

Suganthan, Ponnuthurai N., and Rakesh Katuwal., On the origins of randomization-based feedforward neural networks,Applied Soft Computing, 105, 2021, 107239

work page 2021

[21] [21]

Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V ., Koltun, V ., & Hutter, M., Learning agile and dynamic motor skills for legged robots,Science robotics, 2019, eaau5872

work page 2019

[22] [22]

Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V ., & Hutter, M., Learning quadrupedal locomotion over challenging terrain,Science robotics, 2020, eabc5986

work page 2020

[23] [23]

Rudin, N., Hoeller, D., Reist, P., & Hutter, M., Learning to walk in minutes using massively parallel deep reinforcement learning, Conference on robot learning, 2022, pp 91-100. PMLR

work page 2022

[24] [24]

Hoeller, D., Rudin, N., Sako, D., & Hutter, M., Anymal parkour: Learning agile navigation for quadrupedal robots,Science Robotics, 9.88, 2024, eadi7566

work page 2024

[25] [25]

Schwarke, C., Mittal, M., Rudin, N., Hoeller, D., & Hutter, M., Rsl- rl: A learning library for robotics research,arXiv preprint, 2025, arXiv:2509.10771

work page arXiv 2025