pith. machine review for the scientific record. sign in

arxiv: 2603.12612 · v2 · submitted 2026-03-13 · 💻 cs.LG · cs.AI

Recognition: no theorem link

FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords maximum entropy reinforcement learninghigh-dimensional controlhumanoid roboticsstochastic policiesdistributional criticexploration modulationcontinuous control
0
0 comments X

The pith

FastDSAC shows maximum entropy RL can match or beat deterministic policies in high-dimensional humanoid control

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the scaling of maximum entropy reinforcement learning to high-dimensional humanoid control, where the curse of dimensionality leads to inefficient exploration and unstable training. It introduces dimension-wise entropy modulation to redistribute exploration across action dimensions dynamically and pairs it with a continuous distributional critic that reduces overestimation bias and avoids discrete quantization errors in value estimates. On HumanoidBench and other continuous control benchmarks, the resulting stochastic policies reach state-of-the-art performance among entropy-based methods and often surpass strong deterministic baselines, including 180 percent and 350 percent gains on the Basketball and Balance Hard tasks. This indicates that targeted adjustments can make entropy-regularized stochastic policies practical for complex continuous control.

Core claim

FastDSAC unlocks maximum entropy stochastic policies for high-dimensional humanoid control by using dimension-wise entropy modulation to dynamically allocate the exploration budget and a continuous distributional critic to deliver accurate value estimates free of high-dimensional overestimation and quantization artifacts, yielding state-of-the-art results for stochastic policies that compete with or exceed deterministic baselines.

What carries the argument

Dimension-wise Entropy Modulation (DEM) that reallocates exploration across dimensions, combined with a continuous distributional critic that mitigates overestimation and discretization artifacts during value estimation.

If this is right

  • High-dimensional stochastic policies become viable alternatives to deterministic policy gradients in humanoid control settings.
  • Maximum entropy RL can achieve competitive or superior sample efficiency on challenging continuous control benchmarks.
  • Performance advantages appear on specific difficult tasks such as basketball and hard balance maintenance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The critic and modulation design may transfer to other high-dimensional RL problems that suffer from similar exploration and estimation issues.
  • Real-robot deployment could benefit from the improved adaptability that entropy-driven exploration provides once the method is stabilized.
  • Further reductions in the need for deterministic baselines may occur if the same principles are applied to related entropy-regularized algorithms.

Load-bearing premise

That dimension-wise entropy modulation and the continuous distributional critic together remove the main barriers of the curse of dimensionality and overestimation in maximum entropy RL without creating fresh instabilities or demanding extensive per-task retuning.

What would settle it

A set of training runs on the Basketball or Balance Hard tasks in which FastDSAC fails to produce the reported performance gains over deterministic baselines or shows increased instability relative to them.

Figures

Figures reproduced from arXiv: 2603.12612 by Jun Xue, Junze Wang, Shanze Wang, Wei Zhang, Xinming Zhang, Yanjun Chen.

Figure 1
Figure 1. Figure 1: Performance of FastDSAC on very high-dimensional humanoid control tasks. (a) Visualizations of the challenging Basketball and Balance Hard environments. (b, c) Evaluation curves comparing FastDSAC with the FastTD3 baseline. Curves show the mean over 5 seeds, with shaded regions spanning the min-max range across seeds. FastDSAC outperforms FastTD3, achieving final returns 1.8× and 3.5× higher on the respect… view at source ↗
Figure 2
Figure 2. Figure 2: Comparative evaluation on high-dimensional continuous control benchmarks. Learn￾ing curves across selected tasks from HumanoidBench (top three rows), IsaacLab (fourth row), and MuJoCo Playground (bottom row). FastDSAC matches or often outperforms FastTD3 and FastSAC (Standard). It performs strongly on precision-demanding (e.g., Basketball, Insert) and stability-critical (e.g., Balance Hard) tasks, while re… view at source ↗
Figure 3
Figure 3. Figure 3: DEM improves performance and sta￾bility in high-dimensional control. Learning curves across 12 tasks from three benchmarks com￾paring FastDSAC, its ablations (w/o DEM and auto-τ ), and FastTD3/FastSAC. To assess component necessity, we conducted ablation studies across 3 benchmarks on repre￾sentative tasks spanning rough-terrain locomo￾tion, dynamic whole-body coordination, object interaction, and manipula… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative compari￾son on Basketball. FastDSAC throws well and stays stable, while FastTD3 loses balance and fails. Interpretation: Emergent Strategy via Variance Offload￾ing. The agent autonomously discovers an unconventional “body-rebound” strategy, using the torso rather than the hands to redirect the ball. While counter-intuitive to human design, this behavior maximizes return by prioritizing post-thr… view at source ↗
Figure 5
Figure 5. Figure 5: Continuous critics outperform discrete C51 critics on 9 high-dimensional tasks. Curves show the mean episode re￾turn of FastDSAC compared to FastSAC (C51+DEM) and FastSAC (standard). Q3: Continuous vs. Discrete Distributional Crit￾ics. To isolate the effect of critic parameterization, we compared FastDSAC against a discrete baseline, Fast￾SAC (C51+DEM), the C51 variant of FastSAC [28] augmented with our DE… view at source ↗
Figure 6
Figure 6. Figure 6: Sim-to-real on Uni￾tree G1. Zero-shot joystick￾conditioned forward walking on the G1 robot, showing effective￾ness of FastDSAC in real-world deployments. To verify that FastDSAC transfers effectively to physical hard￾ware, we deploy FastDSAC zero-shot on a Unitree G1 humanoid, following the mature sim-to-real pipeline in [28]. Evaluated on joystick-conditioned locomotion and a complex “squat-carry￾walk-squ… view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the FastDSAC architecture. (Left) Actor with DEM: The policy dy￾namically redistributes the exploration budget by modulating the base standard deviation σˆϕ(s) with weights wi (via element-wise multiplication ⊙). These weights are derived from logits l(s) and an environment-conditioned heterogeneity factor βe using a normalized Softmax. (Middle) Environment: Massively parallel environments coll… view at source ↗
Figure 8
Figure 8. Figure 8: Additional Sim-to-Real Results. Time-lapse sequences of zero-shot FastDSAC policies deployed on the Unitree G1. The robot executes backward walking under real-time joystick commands and a “squat-carry-walk-squat” motion-tracking sequence. C Supplementary Background on FastTD3 and DSAC-T C.1 FastTD3 Recipe Inherited by FastDSAC FastTD3 is presented as a simplified high-throughput TD3 recipe built from paral… view at source ↗
Figure 9
Figure 9. Figure 9: Performance comparison against concurrent baselines on 61-DoF HumanoidBench tasks. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Wall-clock time comparison between FastDSAC and FastTD3. Evaluated on represen￾tative tasks from HumanoidBench, IsaacLab, and MuJoCo Playground. Left: absolute per-task total training time (hours). Right: relative wall-clock time normalized against FastTD3, detailing per-task ratios, the geometric-mean ratio, and the overall total-time ratio across all nine tasks. Benchmark Task Steps FastTD3 (h) FastDSAC… view at source ↗
Figure 11
Figure 11. Figure 11: Exploration weights (wi) during a Basketball episode. High weights (bright) concentrate on the left thumb and wrist, while lower weights (dark) on the legs, torso, and right arm [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of exploration weights at different temperatures. Lower temperatures (τ = 0.5) result in highly concentrated weights on specific active joints, whereas higher temperatures (τ = 10.0) lead to a nearly uniform distribution across all joints. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Support sensitivity of C51+DEM on Basketball. We vary the C51 support range while keeping the remaining training pipeline unchanged. The resulting performance differences show that discrete critics can be sensitive to support specification on this task, whereas FastDSAC avoids this additional tuning axis. K Ablation Study on Layer Normalization To isolate the effect of Layer Normalization (LayerNorm) from… view at source ↗
Figure 14
Figure 14. Figure 14: Ablation study of Layer Normalization. Top row (HumanoidBench, |A| = 61): Removing LayerNorm induces seed sensitivity, severe oscillations, and poor sample efficiency. Bottom row (IsaacLab & MuJoCo Playground, |A| < 30): FastDSAC remains highly stable and sample-efficient even without LayerNorm, indicating that the core algorithm is robust in lower￾dimensional settings. L Ablation Study on Target Entropy … view at source ↗
Figure 15
Figure 15. Figure 15: Ablation on target entropy in Basket￾ball. The standard heuristic (H = −61) forces rapid variance decay, causing high inter-seed vari￾ance and premature convergence (orange curve). FastDSAC maintains H = 0 to provide sufficient exploration capacity, which is structurally stabi￾lized by DEM. Both curves report results over 3 seeds. In standard maximum entropy RL, the target en￾tropy is typically set to H =… view at source ↗
Figure 16
Figure 16. Figure 16: Sensitivity to the DEM temperature τ . Learning curves on Hurdle, Room, and Stair under different fixed temperatures. The default setting τ = 1 remains a robust choice overall. Room favors τ = 1, Stair shows similar performance across τ ∈ {0.5, 1, 5}, and Hurdle benefits from a larger temperature, with τ = 10 achieving the best final return. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Full results across all benchmarks. Curves show mean return over training steps with shaded regions indicating the min-max range across 3 seeds. (a) HumanoidBench (25 tasks): FastD￾SAC (blue) is compared against FastTD3 (black), FastSAC Standard (red), and leading baselines including DreamerV3 (orange), TD-MPC2 (pink), SAC (purple), and PPO (green). FastDSAC often outperforms or matches these baselines ac… view at source ↗
read the original abstract

Scaling Maximum Entropy Reinforcement Learning (RL) to high-dimensional humanoid control remains a fundamental challenge, as the ''curse of dimensionality'' induces severe exploration inefficiency and training instability. Consequently, highly optimized deterministic policy gradients currently dominate high-throughput regimes. We address this limitation with FastDSAC, a framework that effectively unlocks the potential of maximum entropy stochastic policies for complex continuous control. We introduce Dimension-wise Entropy Modulation (DEM) to dynamically redistribute the exploration budget, alongside a continuous distributional critic tailored to ensure accurate value estimation by mitigating both high-dimensional overestimation and discrete quantization artifacts. Extensive evaluations on HumanoidBench and a diverse set of continuous control tasks demonstrate that FastDSAC establishes state-of-the-art performance for high-dimensional stochastic policies on the evaluated benchmarks. Our method is competitive with and often outperforms strong deterministic baselines, with gains of 180% and 350% on the challenging Basketball and Balance Hard tasks, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces FastDSAC, a framework for scaling maximum-entropy RL to high-dimensional humanoid control. It proposes Dimension-wise Entropy Modulation (DEM) to dynamically allocate exploration and a continuous distributional critic to reduce overestimation and quantization errors. Evaluations on HumanoidBench and other continuous-control benchmarks claim state-of-the-art results among stochastic policies, with the method being competitive with or outperforming strong deterministic baselines (gains of 180% on Basketball and 350% on Balance Hard).

Significance. If the reported gains prove robust, the work would be significant because it supplies concrete evidence that maximum-entropy stochastic policies can match or exceed deterministic policy gradients on challenging high-dimensional tasks, potentially reopening exploration of entropy-regularized methods for robotics where robustness to uncertainty is valuable.

major comments (1)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central SOTA and percentage-gain claims (180% Basketball, 350% Balance Hard) are presented without accompanying ablation studies that isolate the contribution of DEM versus the distributional critic, without reported standard deviations or statistical tests across the stated multiple seeds, and without implementation-level details (network architectures, hyper-parameter schedules, or code release). These omissions make the empirical support for the headline claims unverifiable from the supplied text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical verification. We address the major comment below and will revise the manuscript accordingly to improve verifiability.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central SOTA and percentage-gain claims (180% Basketball, 350% Balance Hard) are presented without accompanying ablation studies that isolate the contribution of DEM versus the distributional critic, without reported standard deviations or statistical tests across the stated multiple seeds, and without implementation-level details (network architectures, hyper-parameter schedules, or code release). These omissions make the empirical support for the headline claims unverifiable from the supplied text.

    Authors: We agree that the current presentation lacks sufficient detail to allow full verification. In the revised version we will add ablation studies that separately disable DEM and the continuous distributional critic to quantify their individual contributions. All reported results will include mean performance plus standard deviations over at least five independent random seeds, together with appropriate statistical significance tests. Network architectures, hyper-parameter schedules, and training details will be expanded in a new appendix section. We will also release the full implementation code upon acceptance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents FastDSAC as an empirical framework introducing Dimension-wise Entropy Modulation (DEM) and a continuous distributional critic to address exploration and overestimation issues in high-dimensional maximum-entropy RL. No equations, derivations, or load-bearing steps are described that reduce by construction to fitted inputs, self-definitions, or self-citation chains; the central claims rest on explicit update rules and benchmark results (HumanoidBench, multiple seeds) that remain externally falsifiable. The performance gains are reported as experimental outcomes rather than forced by internal renaming or uniqueness theorems imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, background axioms, or new postulated entities; insufficient information to populate the ledger.

pith-pipeline@v0.9.0 · 5468 in / 1010 out tokens · 34202 ms · 2026-05-15T12:14:44.813148+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

  1. [1]

    Td-mpc2: Scalable, robust world models for continuous control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations, 2024

  2. [2]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  3. [3]

    Hu- manoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation

    Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Hu- manoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation. InRobotics: Science and Systems, 2024

  4. [4]

    Tdmpbc: Self-imitative reinforcement learning for humanoid robot control

    Zifeng Zhuang, Diyuan Shi, Runze Suo, Xiao He, Hongyin Zhang, Ting Wang, Shangke Lyu, and Donglin Wang. Tdmpbc: Self-imitative reinforcement learning for humanoid robot control. arXiv preprint arXiv:2502.17322, 2025

  5. [5]

    Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

    Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

  6. [6]

    Learning to walk in minutes using massively parallel deep reinforcement learning

    Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InProceedings of the 5th Conference on Robot Learning, volume 164 ofProceedings of Machine Learning Research, pages 91–100. PMLR, 2022. URLhttps://proceedings.mlr.press/v164/rudin22a.html

  7. [7]

    Champion-level drone racing using deep reinforcement learning.Nature, 620(7976):982–987, 2023

    Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Müller, Vladlen Koltun, and Davide Scaramuzza. Champion-level drone racing using deep reinforcement learning.Nature, 620(7976):982–987, 2023

  8. [8]

    Parallelq-learning: Scaling off-policy reinforcement learning under massively parallel simulation

    Zechu Li, Tao Chen, Zhang-Wei Hong, Anurag Ajay, and Pulkit Agrawal. Parallelq-learning: Scaling off-policy reinforcement learning under massively parallel simulation. InInternational Conference on Machine Learning, pages 19440–19459. PMLR, 2023

  9. [9]

    Speeding up sac with massively parallel simulation

    Arth Shukla. Speeding up sac with massively parallel simulation. https://arthshukla.substack.com, Mar 2025. URL https://arthshukla.substack. com/p/speeding-up-sac-with-massively-parallel

  10. [10]

    Simplifying deep temporal difference learning

    Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=7IzeL0kflu

  11. [11]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Anto- nio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M....

  12. [12]

    Mujoco playground,

    Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A Kahrs, et al. Mujoco playground. arXiv preprint arXiv:2502.08844, 2025

  13. [13]

    FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

    Donghu Kim, Youngdo Lee, Minho Park, Kinam Kim, I Made Aswin Nahendra, Takuma Seno, Sehee Min, Daniel Palenicek, Florian V ogt, Danica Kragic, Jan Peters, Jaegul Choo, and Hojoon Lee. Flashsac: Fast and stable off-policy reinforcement learning for high-dimensional robot control.arXiv preprint arXiv:2604.04539, 2026

  14. [14]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

  15. [15]

    The curse of dimensionality

    Mario Köppen. The curse of dimensionality. In5th Online World Conference on Soft Computing in Industrial Applications (WSC5), volume 1, pages 4–8, 2000

  16. [16]

    On high-dimensional action selection for deep reinforcement learning, 2024

    Wenbo Zhang and Hengrui Cai. On high-dimensional action selection for deep reinforcement learning, 2024. URLhttps://openreview.net/forum?id=rto6aU453A

  17. [17]

    Pierre Schumacher, Thomas Geijtenbeek, Vittorio Caggiano, Vikash Kumar, Syn Schmitt, Georg Martius, and Daniel F. B. Haeufle. Natural and robust walking using reinforcement learning without demonstrations in high-dimensional musculoskeletal models, 2023. URL https://arxiv.org/abs/2309.02976

  18. [18]

    Scalable exploration for high-dimensional continuous control via value-guided flow, 2026

    Yunyue Wei, Chenhui Zuo, and Yanan Sui. Scalable exploration for high-dimensional continuous control via value-guided flow, 2026. URLhttps://arxiv.org/abs/2601.19707

  19. [19]

    Entropy regularizing activation: Boosting continuous control, large language models, and image classification with activation as entropy constraints

    Zilin Kang, Chonghua Liao, Tingqiang Xu, and Huazhe Xu. Entropy regularizing activation: Boosting continuous control, large language models, and image classification with activation as entropy constraints. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=Cqdsw3yteP

  20. [20]

    Aditya Bhatt, Max Argus, Artemij Amiranashvili, and T. Brox. Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity.ArXiv, abs/1902.05605, 2019

  21. [21]

    Vlearn: Off-policy learning with efficient state-value function estimation, 2024

    Gerhard Neumann, Fabian Otto, Philipp Becker, and Ngo Anh Vien. Vlearn: Off-policy learning with efficient state-value function estimation, 2024. URL https://arxiv.org/abs/2403. 04453

  22. [22]

    Coarse-to-fine q-network with action sequence for data- efficient reinforcement learning, 2025

    Younggyo Seo and Pieter Abbeel. Coarse-to-fine q-network with action sequence for data- efficient reinforcement learning, 2025. URLhttps://arxiv.org/abs/2411.12155

  23. [23]

    Controlling overestimation bias with truncated mixture of continuous distributional quantile critics, 2020

    Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics, 2020. URLhttps://arxiv.org/abs/2005.04269

  24. [24]

    Li, Yangang Ren, Qi Sun, and B

    Jingliang Duan, Yang Guan, S. Li, Yangang Ren, Qi Sun, and B. Cheng. Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors.IEEE Transactions on Neural Networks and Learning Systems, 33:6584–6598, 2020

  25. [25]

    Simplicial embeddings improve sample efficiency in actor-critic agents,

    Johan Obando-Ceron, Walter Mayor, Samuel Lavoie, Scott Fujimoto, Aaron Courville, and Pablo Samuel Castro. Simplicial embeddings improve sample efficiency in actor-critic agents,

  26. [26]

    URLhttps://arxiv.org/abs/2510.13704

  27. [27]

    Distributional soft actor-critic with three refinements

    Jingliang Duan, Wenxuan Wang, Liming Xiao, Jiaxin Gao, Shengbo Eben Li, Chang Liu, Ya-Qin Zhang, Bo Cheng, and Keqiang Li. Distributional soft actor-critic with three refinements. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3935–3946, 2025. doi: 10.1109/TPAMI.2025.3537087

  28. [28]

    A Distributional Perspective on Reinforcement Learning

    Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- ment learning, 2017. URLhttps://arxiv.org/abs/1707.06887. 11

  29. [29]

    Learning sim-to-real humanoid locomotion in 15 minutes, 2025

    Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan, and Pieter Abbeel. Learning sim-to-real humanoid locomotion in 15 minutes, 2025. URL https://arxiv.org/ abs/2512.01996

  30. [30]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, 2023

  31. [31]

    Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control

    Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miło´s, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control. In Advances in Neural Information Processing Systems, 2024. 12 APPENDIXCONTENTS AImplementation Details and Hyperparameters 14 A.1 High-Throughput Training Protocol 14 A.2 Uni...

  32. [32]

    Consistent with recent findings [29, 30, 28], this regularization is critical for maintaining stable gradient flow in expansive action spaces

    Layer Normalization (LayerNorm).We apply LayerNorm in the actor and critic networks exclusively for the ultra-high-dimensional HumanoidBench domain ( |A|= 61 ). Consistent with recent findings [29, 30, 28], this regularization is critical for maintaining stable gradient flow in expansive action spaces. Conversely, for the MuJoCo Playground and IsaacLab ta...

  33. [33]

    vanishing exploration

    Target Entropy.Following recent work [ 28], we set the target entropy H= 0 across all tasks. In high-dimensional spaces, the standard heuristic (H=−|A| ) enforces aggressive variance decay, forcing the policy to prematurely drop its exploration budget and exacerbating the “vanishing exploration” problem. By setting H= 0 , we explicitly maintain a generous...

  34. [34]

    upper bound

    Adaptive Temperature Optimization (auto-τ).The unified default DEM temperature ( τ= 1.0 ) yields strong and robust performance across the vast majority of evaluated tasks and remains our primary recommendation for general use. However, a small subset of highly dynamic tasks requiring extreme whole-body coordination (e.g.,HurdleandBalance Hard) intrinsical...

  35. [35]

    High-Precision Critic Stability.Unlike the original DSAC-T [ 26], which requires a large numerical stabilizer (bias = 0.1) to prevent division-by-zero during continuous distribution modeling, the inherently stable statistics derived from our large-batch, high-throughput regime allow us to reduce this bias significantly to 10−6. This crucial reduction enab...

  36. [36]

    squat-carry-walk-squat

    Implementation Notes.In implementation, the temperature-scaled DEM logits are clipped to a bounded range before the Softmax for numerical stability, and the base log-standard deviation is mapped to fixed bounds via a tanh parameterization. For heterogeneous exploration, the environment- specific scaling factor is kept fixed within each episode and resampl...

  37. [37]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...