RANDPOL: Parameter-Efficient End-to-End Quadruped Locomotion via Randomized Policy Learning
Pith reviewed 2026-05-19 13:42 UTC · model grok-4.3
The pith
RANDPOL achieves comparable quadruped locomotion by training only the final linear layer while fixing random hidden features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RANDPOL attains comparative locomotion performance with far fewer trainable parameters, lower learning-phase computation time per iteration, and a favorable performance-complexity trade-off. We further demonstrate successful zero-shot sim-to-real transfer of the learned RANDPOL controller on the physical Unitree Go2 under user-issued forward-velocity and yaw-rate commands.
What carries the argument
Fixed random nonlinear features in the hidden layers of the actor and critic networks, which supply an expressive random basis so that only the final linear readout needs training.
If this is right
- Training time and memory per iteration drop because gradients are computed only for the readout weights.
- The same fixed-random architecture supports zero-shot transfer from simulation to the physical Unitree Go2 robot.
- Reducing trainable complexity remains compatible with effective simulated and real-world quadruped performance.
- The approach offers a favorable performance-complexity trade-off relative to fully trainable PPO baselines.
Where Pith is reading between the lines
- Similar fixed-random readouts could be tested on other structured tasks such as manipulation or navigation where full network training is costly.
- Hardware with limited on-device training might run RANDPOL-style controllers after offline readout optimization.
- The method invites direct comparison with reservoir-computing or extreme-learning-machine variants on the same locomotion benchmarks.
Load-bearing premise
Fixed random nonlinear features supply enough expressiveness for structured quadruped locomotion without any training of the hidden layers.
What would settle it
A direct comparison showing that RANDPOL cannot match PPO on key locomotion metrics such as forward speed, stability, or energy efficiency even after scaling the number of random features.
Figures
read the original abstract
Modern learning-based locomotion controllers typically rely on fully trainable deep neural networks with a large number of parameters. This paper studies a different design point for end-to-end control: whether effective quadruped locomotion can be achieved with a drastically reduced trainable parameter space. We present RANDomized POlicy Learning (RANDPOL), a policy learning approach in which the hidden layers of the actor and critic are randomly initialized and fixed, while only the final linear readout is trained. This yields a parameter-efficient controller class that retains nonlinear expressiveness through a fixed random basis while substantially reducing the dimension of the optimization problem. RANDPOL is supported by the mathematical foundation of randomized function approximation, which provides a principled basis for using fixed random nonlinear features as expressive function classes. We evaluate RANDPOL on end-to-end locomotion control for the Unitree Go2 quadruped and compare it with Proximal Policy Optimization (PPO). The results show that RANDPOL attains comparative locomotion performance with far fewer trainable parameters, lower learning-phase computation time per iteration, and a favorable performance-complexity trade-off. We further demonstrate successful zero-shot sim-to-real transfer of the learned RANDPOL controller on the physical Unitree Go2 under user-issued forward-velocity and yaw-rate commands. These results indicate that, for structured robotic control problems, reducing trainable complexity can remain compatible with effective simulated and real-world performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RANDPOL, a policy learning method for end-to-end quadruped locomotion in which the hidden layers of the actor and critic are randomly initialized and fixed while only the final linear readout is trained. It claims that this yields locomotion performance comparable to standard PPO on the Unitree Go2, with substantially fewer trainable parameters, lower per-iteration computation during learning, and a favorable performance-complexity trade-off, together with successful zero-shot sim-to-real transfer under forward-velocity and yaw-rate commands. The approach is justified by randomized function approximation theory.
Significance. If the empirical claims hold under rigorous verification, the work would be significant for showing that fixed random nonlinear features can supply adequate expressiveness for high-dimensional structured control tasks such as quadruped locomotion. This could reduce training cost and memory footprint for RL-based controllers while preserving sim-to-real capability, and it provides a concrete test of randomized approximation ideas in a robotics setting.
major comments (2)
- [Method] Method section: the paper does not specify the hidden-layer width, the distribution used to draw the random weights and biases, or any kernel-alignment or approximation-quality diagnostic. These details are load-bearing for the central claim that fixed random features provide a sufficiently rich basis for both actor and critic without hidden-layer training.
- [Experiments] Experiments and Results sections: the claim of 'comparative locomotion performance' with PPO is stated without quantitative metrics (e.g., velocity tracking error, success rate, or return), statistical details across seeds, or explicit description of how the PPO baseline network size and training budget were matched. This leaves the performance-complexity trade-off assertion with limited verifiable support.
minor comments (1)
- [Abstract] Abstract: the reductions in trainable parameters and per-iteration compute time are described qualitatively ('far fewer', 'lower') but not quantified, which would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [Method] Method section: the paper does not specify the hidden-layer width, the distribution used to draw the random weights and biases, or any kernel-alignment or approximation-quality diagnostic. These details are load-bearing for the central claim that fixed random features provide a sufficiently rich basis for both actor and critic without hidden-layer training.
Authors: We agree that these implementation details are necessary to substantiate the claim that fixed random features suffice for the task. In the revised manuscript we will explicitly report the hidden-layer width (256 units), the weight initialization distribution (zero-mean Gaussian with variance 1 over input dimension, biases initialized to zero), and a short discussion of the randomized function approximation theory that underpins the expressiveness of the fixed basis for both actor and critic. revision: yes
-
Referee: [Experiments] Experiments and Results sections: the claim of 'comparative locomotion performance' with PPO is stated without quantitative metrics (e.g., velocity tracking error, success rate, or return), statistical details across seeds, or explicit description of how the PPO baseline network size and training budget were matched. This leaves the performance-complexity trade-off assertion with limited verifiable support.
Authors: We acknowledge that additional quantitative detail would strengthen the presentation. The revised Experiments and Results sections will include explicit metrics (velocity tracking error, success rate, and return), report means and standard deviations across five independent seeds, and clarify that the PPO baseline uses the same hidden-layer dimensions with all parameters trainable while matching the total number of environment steps. revision: yes
Circularity Check
No circularity: external randomized approximation theory plus empirical validation
full rationale
The paper's core design—fixing random hidden layers and training only the final linear readout—is justified by reference to the established external theory of randomized function approximation rather than any internal derivation or fit. Performance claims rest on direct empirical comparison to PPO, including sim-to-real transfer metrics on the Unitree Go2, without renaming fitted quantities as predictions or invoking self-citations whose content reduces to the present work. The derivation chain therefore remains self-contained against independent mathematical results and experimental benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Randomized function approximation supplies a principled basis for using fixed random nonlinear features as expressive function classes for control policies.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RANDPOL samples the hidden layers once, keeps them fixed throughout training, and optimizes only the final linear layer... supported by the mathematical foundation of randomized function approximation
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
random basis functions generated by a frozen randomized neural network
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mnih, V olodymyr, et al., Human-level control through deep reinforce- ment learning,nature, 518.7540 (2015): 529-533
work page 2015
- [2]
-
[3]
Mnih, V olodymyr, et al., Asynchronous methods for deep reinforce- ment learning,International conference on machine learning, PmLR, 2016
work page 2016
- [4]
-
[5]
Chou, Po-Wei, Daniel Maturana, and Sebastian Scherer, Improving stochastic policy gradients in continuous control with deep reinforce- ment learning using the beta distribution,International conference on machine learning, pp. 834-843. PMLR, 2017
work page 2017
-
[6]
Silver, David, et al., Mastering the game of Go with deep neural networks and tree search,nature, 529.7587 (2016): 484-489
work page 2016
-
[7]
Silver, David, et al., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play,Science, 362.6419 (2018): 1140-1144
work page 2018
-
[8]
Bellegarda, G., Chen, Y ., Liu, Z., & Nguyen, Q., Robust high- speed running for quadruped robots via deep reinforcement learning, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022. pp. 10364-10370. IEEE
work page 2022
-
[9]
Bellegarda, Guillaume, and Auke Ijspeert., CPG-RL: Learning central pattern generators for quadruped locomotion,IEEE Robotics and Automation Letters, 2022, pp 12547-12554
work page 2022
-
[10]
Kumar, A., Fu, Z., Pathak, D., & Malik, J., Rma: Rapid motor adaptation for legged robots,arXiv preprint, 2021, arXiv:2107.04034
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [11]
-
[12]
Rahimi, Ali, and Benjamin Recht., Uniform approximation of func- tions with random bases,allerton conference on communication, control, and computing, vol. 46, 2008, pp 265-271. IEEE
work page 2008
- [13]
-
[14]
Gallicchio, Claudio, and Simone Scardapane., Deep randomized neural networks,Recent Trends in Learning From Data: Tutorials from the INNS Big Data and Deep Learning Conference (INNSBDDL2019), 2020, pp 43-68, Cham: Springer International Publishing,
work page 2020
-
[15]
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O., Proximal policy optimization algorithms,arXiv preprint, 2017, arXiv:1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
Mittal, Mayank, et al., Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning,arXiv preprint, 2025, arXiv:2511.04831
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P., High- dimensional continuous control using generalized advantage estima- tion,arXiv preprint, 2015, arXiv:1506.02438
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[18]
Gallicchio, Claudio, Alessio Micheli, and Luca Pedrelli., Deep reser- voir computing: A critical experimental analysis,Neurocomputing, 268, 2017, pp 87-99
work page 2017
-
[19]
Huang, Guang-Bin, Qin-Yu Zhu, and Chee-Kheong Siew., Extreme learning machine: theory and applications,Neurocomputing, 70.1-3, 2006, pp 489-501
work page 2006
-
[20]
Suganthan, Ponnuthurai N., and Rakesh Katuwal., On the origins of randomization-based feedforward neural networks,Applied Soft Computing, 105, 2021, 107239
work page 2021
-
[21]
Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V ., Koltun, V ., & Hutter, M., Learning agile and dynamic motor skills for legged robots,Science robotics, 2019, eaau5872
work page 2019
-
[22]
Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V ., & Hutter, M., Learning quadrupedal locomotion over challenging terrain,Science robotics, 2020, eabc5986
work page 2020
-
[23]
Rudin, N., Hoeller, D., Reist, P., & Hutter, M., Learning to walk in minutes using massively parallel deep reinforcement learning, Conference on robot learning, 2022, pp 91-100. PMLR
work page 2022
-
[24]
Hoeller, D., Rudin, N., Sako, D., & Hutter, M., Anymal parkour: Learning agile navigation for quadrupedal robots,Science Robotics, 9.88, 2024, eadi7566
work page 2024
- [25]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.