arxiv: 2605.04185 · v1 · submitted 2026-05-05 · 💻 cs.LG · cs.RO

Recognition: 3 theorem links

· Lean Theorem

Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing

Qijun Liao , Zhaoxin Yu , Jue Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:35 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords reinforcement learningconstraint satisfactionactuator rate limitsspherical squashingroboticspolicy optimizationsafe reinforcement learning

0 comments

The pith

Dynamic Decoupled Spherical Radial Squashing enforces hard per-joint actuator constraints in reinforcement learning by adapting the radius to each actuator's position.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to fix the geometric mismatch in constraint handling for reinforcement learning on robots, where joint speed limits differ and form a box-shaped feasible region rather than a sphere. It proposes computing an independent, position-dependent radius for squashing the action increment of each actuator separately. This dynamic decoupling is claimed to satisfy the hard constraints almost surely while keeping gradients suitable for training and allowing direct backpropagation without solving any optimization at runtime. If successful, it would let learned policies achieve the same high returns as unconstrained ones but with guaranteed safety on physical hardware, as tested on MuJoCo and humanoid simulations. A reader would care because current methods either violate limits or sacrifice performance to stay safe.

Core claim

The paper claims that by decoupling the spherical radial squashing and making the radius dynamic and actuator-specific based on current position, the method exactly matches the heterogeneous box constraint in action space, leading to hard constraint satisfaction with probability one, well-conditioned gradients, and exact policy gradient computation with no solver cost.

What carries the argument

Dynamic Decoupled Spherical Radial Squashing (DD-SRad), the mechanism that calculates a unique position-adaptive radius for each actuator to squash its action increment independently.

If this is right

Training requires no additional constraint solvers at runtime.
Highest task returns are achieved while maintaining zero constraint violations.
Feasible set coverage improves 30 to 50 percent over prior spherical approaches.
Policies can be derived directly from official robot joint specifications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could apply to other state-dependent or heterogeneous constraints in control tasks.
Testing in higher-dimensional systems might reveal if independence holds without additional coupling terms.
It provides a direct mapping from hardware data to safe RL deployment without manual tuning.

Load-bearing premise

The true high-dimensional box constraint on action increments can be fully represented by applying independent radial squashing adjustments to each actuator without missing inter-joint effects.

What would settle it

A demonstration of constraint violations during training or inference on a system with strongly coupled actuator dynamics, or no coverage gain in high-dimensional action spaces.

Figures

Figures reproduced from arXiv: 2605.04185 by Jue Yang, Qijun Liao, Zhaoxin Yu.

**Figure 1.** Figure 1: Mean return curves of DD-SRad under wide homogeneous and tight heterogeneous constraints with SAC/TD3 view at source ↗

**Figure 2.** Figure 2: Mean return and per-dimension utilization of DD-SRad (SAC) on Ant-v5 under varying view at source ↗

**Figure 3.** Figure 3: Learning curves and radar charts for H1 and G1 (mean view at source ↗

**Figure 4.** Figure 4: SAC/TD3-Based benchmark performance comparison across four environments. view at source ↗

**Figure 5.** Figure 5: Per-dimension constraint utilization radar charts of all methods on four environments under tight heterogeneous view at source ↗

**Figure 6.** Figure 6: Per-dimension constraint utilization radar charts of DD-SRad under wide homogeneous (blue) and tight heterogeneous view at source ↗

**Figure 7.** Figure 7: Empirical Reachable Sets demands under flat terrain’s low perturbation, and the policy converges to near-zero ankle_roll through independent per-dimension optimization. D-Tanh exhibits the same morphology, confirming that this solution structure is jointly determined by ℓ∞ geometry and task rewards, independent of the specific squashing function; the task-metric gap (VTE: 0.138 vs 0.207, Return: 5473 vs 51… view at source ↗

**Figure 8.** Figure 8: H1 and G1 per-dimension action distribution view at source ↗

read the original abstract

When deploying reinforcement learning policies to physical robots, actuator rate constraints -- hard limits on how fast each joint can move per control step -- are unavoidable. These limits vary substantially across joints due to differences in motor inertia, power bandwidth, and transmission stiffness, creating pronounced heterogeneity that existing methods fail to handle geometrically: the per-joint feasible region forms a high-dimensional box in action-increment space, yet QP projection and spherical parameterization methods impose isotropic ball-shaped constraints, exponentially under-covering the true feasible set as heterogeneity grows. This paper proposes Dynamic Decoupled Spherical Radial Squashing (DD-SRad), which resolves this mismatch by computing a position-adaptive radius independently for each actuator, achieving tight alignment with the true per-joint feasible region. DD-SRad satisfies per-step hard constraints with probability~1, preserves well-conditioned gradients throughout training, and admits exact policy gradient backpropagation with zero runtime solver overhead. MuJoCo benchmark experiments demonstrate the highest task return at zero constraint violation -- matching the unconstrained upper bound -- with 30%--50% improvement in constraint-space coverage over spherical baselines. High-fidelity IsaacLab simulations with Unitree H1 and G1 humanoid robots confirm end-to-end optimality parameterized directly from official joint specifications, validating a systematic pathway from hardware datasheets to safe deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DD-SRad gives a per-actuator dynamic squashing method that targets heterogeneous joint limits in robotic RL, but the claim of exact box coverage without coupling needs direct verification from the math.

read the letter

The paper's core contribution is a construction called Dynamic Decoupled Spherical Radial Squashing that computes an independent, state-dependent radius for each actuator instead of using a single isotropic sphere or a runtime QP solver. This directly addresses the mismatch between box-shaped rate limits and the round constraints that standard spherical methods impose, which the abstract notes gets worse as heterogeneity increases. The MuJoCo results showing task returns that match the unconstrained case at zero violations, plus the 30-50% coverage lift over baselines, are the concrete evidence offered. The IsaacLab humanoid runs with real joint specs add a practical angle that is not just toy benchmarks.

Referee Report

3 major / 2 minor

Summary. The paper proposes Dynamic Decoupled Spherical Radial Squashing (DD-SRad) for enforcing heterogeneous actuator rate constraints in RL policies. It computes independent position-adaptive radii per actuator to align the feasible set in action-increment space with the true high-dimensional box, claiming per-step hard constraint satisfaction with probability ~1, well-conditioned gradients, and exact policy-gradient backpropagation with no runtime solver. MuJoCo benchmarks and IsaacLab humanoid simulations (Unitree H1/G1) report highest task returns at zero violations, matching unconstrained performance, plus 30-50% better constraint-space coverage than spherical baselines, with direct parameterization from hardware joint specs.

Significance. If the geometric equivalence and gradient claims hold, DD-SRad would offer a practical, overhead-free method for safe RL on robots with heterogeneous joints, directly linking datasheet limits to policy training while preserving optimality. The reported ability to match unconstrained returns at zero violation is a strong empirical result; the coverage gains and end-to-end validation on full humanoids add practical value over isotropic or QP-based alternatives.

major comments (3)

[§3] §3 (DD-SRad construction): the claim that independent per-actuator dynamic radii produce a feasible set coinciding exactly with the heterogeneous box requires an explicit proof or derivation showing no coupling from shared normalization, direction vectors, or radial scaling. Any such coupling would shrink the reachable set below the box (e.g., to an inscribed ellipsoid), falsifying both the probability~1 hard-constraint guarantee and the reported coverage improvement.
[§4] §4 (gradient and backpropagation): the state-dependent radii make the squashing a function of the current state; the manuscript must demonstrate that differentiation through this dependence yields exact policy gradients without approximation. Failure here would invalidate the 'exact backpropagation' and 'well-conditioned gradients' claims even if the forward map is feasible.
[Experiments] Experimental results (MuJoCo and IsaacLab sections): the 30%-50% constraint-space coverage gain and 'zero violation' results need a precise definition of the coverage metric, plus reporting of variance and statistical tests across seeds to support superiority over spherical baselines.

minor comments (2)

[§3] Notation for the radial squashing function should be introduced with a clear equation number and distinguished from prior spherical parameterization methods.
[Abstract] The abstract's 'probability~1' phrasing should be replaced by a precise statement (e.g., 'almost surely' or 'with probability 1 under the policy distribution') once the supporting lemma is stated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify areas where additional rigor will strengthen the manuscript. We address each major comment below and will incorporate the requested clarifications and supporting material in the revision.

read point-by-point responses

Referee: [§3] §3 (DD-SRad construction): the claim that independent per-actuator dynamic radii produce a feasible set coinciding exactly with the heterogeneous box requires an explicit proof or derivation showing no coupling from shared normalization, direction vectors, or radial scaling. Any such coupling would shrink the reachable set below the box (e.g., to an inscribed ellipsoid), falsifying both the probability~1 hard-constraint guarantee and the reported coverage improvement.

Authors: We agree that an explicit derivation is necessary to confirm the absence of unintended coupling. The DD-SRad construction computes each actuator's radius independently from its own position and hardware limits, using per-component normalization and unit direction vectors with no shared scaling across dimensions. In the revised manuscript we will insert a formal proof in §3 demonstrating that the resulting feasible set is exactly the heterogeneous box: the component-wise squashing maps the unit ball in normalized space back to the original box boundaries without shrinkage or cross terms. This will directly support the probability-1 guarantee and the coverage claims. revision: yes
Referee: [§4] §4 (gradient and backpropagation): the state-dependent radii make the squashing a function of the current state; the manuscript must demonstrate that differentiation through this dependence yields exact policy gradients without approximation. Failure here would invalidate the 'exact backpropagation' and 'well-conditioned gradients' claims even if the forward map is feasible.

Authors: We acknowledge that the state dependence of the radii must be handled explicitly in the gradient derivation. Because the radii are smooth functions of the observed state and the squashing operation is applied after the policy output, the chain rule yields an exact Jacobian that includes the partial derivatives with respect to the radii. In the revised §4 we will provide the full analytic differentiation steps, showing that the policy-gradient estimator remains exact (no Monte-Carlo approximation of the radius gradient) and that the resulting gradients remain well-conditioned. This will be accompanied by a short numerical verification that the analytic gradient matches finite-difference checks. revision: yes
Referee: [Experiments] Experimental results (MuJoCo and IsaacLab sections): the 30%-50% constraint-space coverage gain and 'zero violation' results need a precise definition of the coverage metric, plus reporting of variance and statistical tests across seeds to support superiority over spherical baselines.

Authors: We agree that the coverage metric and statistical reporting require clarification. In the revised experimental sections we will (i) give the exact definition of coverage (volume ratio of the DD-SRad feasible set to the true heterogeneous box, computed via Monte-Carlo sampling in normalized space), (ii) report mean and standard deviation of all metrics across at least five independent random seeds, and (iii) include paired t-tests or Wilcoxon signed-rank tests with p-values to establish statistical significance of the reported gains over spherical baselines. The zero-violation claim will be restated with the precise per-step violation probability observed across all evaluation episodes. revision: yes

Circularity Check

0 steps flagged

No circularity: DD-SRad is a direct geometric construction validated on benchmarks

full rationale

The paper defines DD-SRad explicitly as a new per-actuator, position-adaptive radial squashing operation that aligns the feasible set with the heterogeneous box constraint in action-increment space. No step in the provided abstract or description reduces a claimed prediction or property to a fitted input, self-citation, or redefinition; the hard-constraint guarantee, gradient conditioning, and exact backpropagation are presented as direct consequences of the decoupled construction itself. Experimental results on MuJoCo and IsaacLab are reported as empirical validation of the method rather than as inputs to the derivation. The approach therefore remains self-contained against external benchmarks with no load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access provides no explicit free parameters, axioms, or invented entities; the DD-SRad construct is introduced at a conceptual level without mathematical details.

pith-pipeline@v0.9.0 · 5529 in / 1373 out tokens · 82643 ms · 2026-05-08T17:35:01.659866+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost (Jcost) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a = c_s + R · u / sqrt(1 + ‖u‖²) ... a_i = a_prev_i + R_eff_i(u_i, a_prev_i) · u_i / sqrt(1 + (u_i)²)
IndisputableMonolith/Foundation/AlexanderDuality (D=3 forcing) alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

V_DD / V_SRad = 2^d Γ(d/2+1)/π^(d/2) · ∏ δ_i / (min_i δ_i)^d (Theorem 2.7).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Proprioceptive actuator design in the MIT Cheetah: Impact mitigation and high-bandwidth physical interaction for dynamic legged robots,

P. M. Wensing, A. Wang, S. Seok, D. Otten, J. Lang, and S. Kim, “Proprioceptive actuator design in the MIT Cheetah: Impact mitigation and high-bandwidth physical interaction for dynamic legged robots,” IEEE Transactions on Robotics, vol. 33, no. 3, pp. 509–522, June 2017

2017
[2]

Perceptive locomotion through nonlinear model-predictive control,

R. Grandia, F. Jenelten, S. Yang, F. Farshidian, and M. Hutter, “Perceptive locomotion through nonlinear model-predictive control,”IEEE Transactions on Robotics, vol. 39, no. 5, pp. 3402–3421, Oct. 2023

2023
[3]

MIT Cheetah 3: Design and control of a robust, dynamic quadruped robot,

G. Bledt, M. J. Powell, B. Katz, J. Di Carlo, P. M. Wensing, and S. Kim, “MIT Cheetah 3: Design and control of a robust, dynamic quadruped robot,” in2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 2018, pp. 2245–2252

2018
[4]

Highly dynamic quadruped locomotion via whole-body impulse control and model predictive control,

D. Kim et al., “Highly dynamic quadruped locomotion via whole-body impulse control and model predictive control,”arXiv preprint arXiv:1909.06586, 2019

work page arXiv 1909
[5]

Robust predictive control for quadrupedal locomotion: Learning to close the gap between reduced and full-order models,

A. Pandala, R. T. Fawcett, and K. A. Hamed, “Robust predictive control for quadrupedal locomotion: Learning to close the gap between reduced and full-order models,”IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 6622–6629, 2022

2022
[6]

Information theoretic MPC for model-based reinforcement learning,

G. Williams et al., “Information theoretic MPC for model-based reinforcement learning,” in2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 1714–1721

2017
[7]

Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning,

A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning,” in2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 2018, pp. 7559–7566

2018
[8]

Constrained policy optimization,

J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” inProceedings of the 34th International Conference on Machine Learning (ICML), 2017, pp. 22–31

2017
[9]

Reward constrained policy optimization,

C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,” inInternational Conference on Learning Representations (ICLR), 2019

2019
[10]

First order constrained optimization in policy space,

Y . Zhang, Q. Vuong, and K. W. Ross, “First order constrained optimization in policy space,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[11]

Constrained update projection approach to safe policy optimization,

L. Yang et al., “Constrained update projection approach to safe policy optimization,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[12]

Penalized proximal policy optimization for safe reinforcement learning,

L. Zhang et al., “Penalized proximal policy optimization for safe reinforcement learning,” inInternational Joint Conference on Artificial Intelligence (IJCAI), 2022

2022
[13]

Altman,Constrained Markov Decision Processes

E. Altman,Constrained Markov Decision Processes. CRC Press, 1999

1999
[14]

Benchmarking actor-critic deep reinforcement learning algorithms for robotics control with action constraints,

K. Kasaura, S. Miura, T. Kozuno, R. Yonetani, K. Hoshino, and Y . Hosoe, “Benchmarking actor-critic deep reinforcement learning algorithms for robotics control with action constraints,”IEEE Robotics and Automation Letters, vol. 8, no. 8, pp. 4449-4456, Aug. 2023

2023
[15]

FlowPG: Action-constrained policy gradient with normalizing flows,

J. C. Brahmanage, J. Ling, and A. Kumar, “FlowPG: Action-constrained policy gradient with normalizing flows,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[16]

Efficient action-constrained reinforcement learning via acceptance-rejection method and augmented MDPs,

W.-T. Hung et al., “Efficient action-constrained reinforcement learning via acceptance-rejection method and augmented MDPs,” inInternational Conference on Learning Representations (ICLR), 2025. 10

2025
[17]

Improving Stochastic Action-Constrained Reinforcement Learning via Truncated Distribu- tions,

P. Stolz et al., “Improving Stochastic Action-Constrained Reinforcement Learning via Truncated Distribu- tions,”arXiv preprint arXiv:2511.22406, 2025

work page arXiv 2025
[18]

Residual reinforcement learning for robot control,

T. Johannink et al., “Residual reinforcement learning for robot control,” inIEEE International Conference on Robotics and Automation (ICRA), 2019, pp. 6023–6029

2019
[19]

Learning quadrupedal locomotion over challenging terrain,

J. Lee et al., “Learning quadrupedal locomotion over challenging terrain,”Science Robotics, vol. 5, no. 47, 2020

2020
[20]

A safe hierarchical planning framework for complex driving scenarios based on reinforcement learning,

J. Li, L. Sun, J. Chen, M. Tomizuka, and W. Zhan, “A safe hierarchical planning framework for complex driving scenarios based on reinforcement learning,” in2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 2021, pp. 2660–2666

2021
[21]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

A. Rajeswaran et al., “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations,”arXiv preprint arXiv:1709.10087, 2017

work page Pith review arXiv 2017
[22]

Data-driven economic NMPC using reinforcement learning,

S. Gros and M. Zanon, “Data-driven economic NMPC using reinforcement learning,”IEEE Transactions on Automatic Control, vol. 65, no. 2, pp. 636–648, Feb. 2020

2020
[23]

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

M. Vecerík et al., “Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards,”arXiv preprint arXiv:1707.08817, 2017

work page Pith review arXiv 2017
[24]

Closing the sim-to-real loop: Adapting simulation randomization with real world experience,

Y . Chebotar et al., “Closing the sim-to-real loop: Adapting simulation randomization with real world experience,” inIEEE International Conference on Robotics and Automation (ICRA), 2019, pp. 8973–8979

2019
[25]

Benchmarking safe exploration in deep reinforcement learning,

A. Ray, J. Achiam, and D. Amodei, “Benchmarking safe exploration in deep reinforcement learning,” 2019. [Online]. Available:https://cdn.openai.com/safexp-short.pdf

2019
[26]

A survey of constraint formulations in safe reinforcement learning,

A. Wachi, X. Shen, and Y . Sui, “A survey of constraint formulations in safe reinforcement learning,” in International Joint Conference on Artificial Intelligence (IJCAI), 2024

2024
[27]

A review of safe reinforcement learning: Methods, theories, and applications,

S. Gu et al., “A review of safe reinforcement learning: Methods, theories, and applications,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 11216–11235, Dec. 2024

2024
[28]

D. P. Bertsekas and S. E. Shreve,Stochastic Optimal Control: The Discrete Time Case. Athena Scientific, 1996

1996
[29]

Continuous control with deep reinforcement learning,

T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” inProceedings of the Fourth International Conference on Learning Representations (ICLR), 2016

2016
[30]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1861–1870

2018
[31]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review arXiv 2017
[32]

Addressing function approximation error in actor-critic methods,

S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1587–1596

2018
[33]

Control barrier function based quadratic programs for safety critical systems,

A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,”IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861–3876, Aug. 2017

2017
[34]

OptLayer – Practical constrained optimization for deep reinforcement learning in the real world,

T.-H. Pham, G. De Magistris, and R. Tachibana, “OptLayer – Practical constrained optimization for deep reinforcement learning in the real world,” in2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 2018, pp. 6236–6243

2018
[35]

Safe exploration in continuous action spaces,

G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y . Tassa, “Safe exploration in continuous action spaces,”arXiv preprint arXiv:1801.08757, 2018

work page arXiv 2018
[36]

Safety-critical kinematic control of robotic systems,

A. Singletary, S. Kolathaya, and A. D. Ames, “Safety-critical kinematic control of robotic systems,”IEEE Control Systems Letters, vol. 6, pp. 139–144, 2022

2022
[37]

Leveraging constraint violation signals for action-constrained reinforcement learning,

J. C. Brahmanage, J. Ling, and A. Kumar, “Leveraging constraint violation signals for action-constrained reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025

2025
[38]

Learning sim-to-real humanoid locomotion in 15 minutes, 2025

Y . Seo, C. Sferrazza, J. Chen, G. Shi, R. Duan, and P. Abbeel, “Learning sim-to-real humanoid locomotion in 15 minutes,”arXiv preprint arXiv:2512.01996, 2025. 11 A Extended Related Work The most mature engineering solution for handling actuator constraints is the hierarchical control architecture, which decouples the high-level RL policy from the low-lev...

work page arXiv 2025
[39]

|∆ai|=R i eff ·f(u i)< R i eff ≤δ i, where the strict inequality uses f(u i)<1

Rate constraint. |∆ai|=R i eff ·f(u i)< R i eff ≤δ i, where the strict inequality uses f(u i)<1 . (When ai prev =a i max one has Ri eff = 0, giving ∆ai = 0, which also satisfies |∆ai| ≤δ i.)
[40]

Moreover, ∆ai < R i eff ≤a i max −a i prev, so ai < a i max

Position constraint.Since ∆ai =R i efff(u i)≥0 , it follows that ai =a i prev + ∆ai ≥ ai prev ≥a i min. Moreover, ∆ai < R i eff ≤a i max −a i prev, so ai < a i max. Hence ai ∈ [ai min, ai max]. If additionally ai prev ∈(a i min, ai max), then Ri eff >0 , so |∆ai|< δ i and ai ∈(a i min, ai max) hold strictly. Case 2: ui <0 .Then Ri eff = min δi, a i prev −...
[41]

Substituting the spectral norm bound from (ii) yields inequality (9)

Differentiating with respect touand applying the triangle inequality: ∥∇uLactor∥2 = ∇aQ(˜s, a) a=ϕ(u) ·J(u) + 2λ baseu 2 ≤ ∥J(u)∥ 2 · ∥∇aQ(˜s, a)∥2 +2λ base∥u∥2. Substituting the spectral norm bound from (ii) yields inequality (9). For the SAC backbone, the entropy term αlogπ θ(a|˜s)contributes an additional gradient term α∇u logπ θ(a|˜s), which enters li...

2037