arxiv: 2605.11473 · v1 · submitted 2026-05-12 · 💻 cs.AI · cs.LG· cs.RO· stat.ML

Recognition: 2 theorem links

· Lean Theorem

TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

Yuanpeng Li , Gefei Lin , Annie Qu , Rui Miao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:43 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.ROstat.ML

keywords multi-task reinforcement learningproximal policy optimizationcritic balancinggradient ill-conditioningmeta-world benchmarkon-policy methodstail task performancesoft actor-critic

0 comments

The pith

By fixing critic gradient ill-conditioning, TOPPO lets PPO match or beat SAC in multi-task RL while using fewer parameters and steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that PPO in multi-task reinforcement learning fails because the critic's gradients become ill-conditioned, letting easy tasks dominate updates and stalling harder ones. TOPPO adds Critic Balancing modules to restore proper conditioning and equalize learning across tasks without larger models or modular architectures. If this holds, on-policy methods can rival the dominant off-policy approaches in MTRL while consuming less data and computation. The work targets the optimization process inside PPO itself rather than changing the overall learning paradigm.

Core claim

TOPPO reformulates PPO through Critic Balancing modules that correct critic-side gradient ill-conditioning in multi-task reinforcement learning. This produces stronger mean and tail-task performance than published SAC-family and ARS-family baselines while using substantially fewer parameters and environment steps on the Meta-World+ benchmark. The method matches or surpasses strong SAC baselines early in training and sustains the advantage at full budget, showing that on-policy methods can compete when the critic optimization bottleneck is addressed directly.

What carries the argument

Critic Balancing modules, a set of components that improve gradient conditioning and balance learning dynamics across tasks inside the PPO update.

If this is right

Mean performance across tasks rises because no individual task stalls the shared critic.
Competitive results appear with fewer environment steps than required by SAC or ARS baselines.
Smaller policy and critic networks suffice once gradient conditioning is restored.
On-policy PPO can equal or exceed off-policy SAC performance once the critic update is balanced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same balancing idea could be tested on other on-policy algorithms that share critics across tasks.
In data-limited settings the focus may shift from off-policy replay to fixing on-policy gradient flow.
Gradient conditioning problems may appear in single-task PPO when task difficulty varies internally.

Load-bearing premise

That critic-side gradient ill-conditioning is the main cause of tail-task stalling in standard PPO for multi-task RL and that the balancing modules fix it without new side effects on stability or other tasks.

What would settle it

An experiment on Meta-World+ showing that adding the Critic Balancing modules produces no gain in tail-task success rates or overall returns compared with plain PPO under identical conditions.

Figures

Figures reproduced from arXiv: 2605.11473 by Annie Qu, Gefei Lin, Rui Miao, Yuanpeng Li.

**Figure 2.** Figure 2: Per-task gradient norm for critic vs. actor in MT10 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Early-stage (∼0.5M steps) MT10 Gram matrices of critic gradients. Off-diagonal values are cos ∠(g c i , g c j ) and diagonal log10 Gc ii. Histograms: ∥g c i ∥/K and magnitudespread ratio. See Figure D.1 for mid/late stages. To address unfair aggregation (D3), we replace the critic mean update with FairGrad aggregation, d c = P i wig c i , where w ∈ R K >0 is chosen by an αfairness objective over task con… view at source ↗

**Figure 4.** Figure 4: MT50 ablation ladder. (a) Final mean. (b) Final worst [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diagnose that PPO in MTRL suffers from a previously overlooked issue: critic-side gradient ill-conditioning, which may cause tail tasks to stall while easy tasks dominate the value function's updates. To address this, we propose TOPPO (Tail-Optimized PPO), a reformulation of PPO via Critic Balancing -- a set of modules that improve gradient conditioning and balance learning dynamics across tasks. Unlike prior approaches that rely on modular architectures or large models, TOPPO targets the optimization bottleneck within PPO itself. Empirically, TOPPO achieves stronger mean and tail-task performance than published SAC-family and ARS-family baselines while using substantially fewer parameters and environment steps on Meta-World+ benchmark. Notably, TOPPO matches or surpasses strong SAC baselines early in training and maintains superior performance at full budget. Ablations confirm the effectiveness of each module in TOPPO and provide insights into their interactions. Our results demonstrate that, with proper optimization, on-policy methods can rival or exceed off-policy approaches in MTRL, challenging the prevailing reliance on SAC and highlighting critic-side gradient conditioning as the central bottleneck.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TOPPO adds critic balancing modules to PPO to target gradient ill-conditioning in multi-task RL and reports stronger Meta-World+ results than SAC baselines with fewer params, but the mechanism lacks direct verification.

read the letter

The main takeaway is that this paper reformulates PPO with a set of Critic Balancing modules to address what they call critic-side gradient ill-conditioning in multi-task settings. They argue this issue lets easy tasks dominate updates and stalls tail tasks, and their TOPPO version delivers better mean and tail performance than SAC-family and ARS baselines on Meta-World+ while using substantially fewer parameters and environment steps. It also matches strong SAC early and holds the edge at full budget, with ablations on the modules.

Referee Report

2 major / 2 minor

Summary. The paper claims that PPO in multi-task RL suffers from an overlooked critic-side gradient ill-conditioning issue that stalls tail tasks while easy tasks dominate, and introduces TOPPO via Critic Balancing modules to improve conditioning and balance dynamics. It reports that TOPPO achieves stronger mean and tail-task performance than published SAC-family and ARS-family baselines on Meta-World+ while using substantially fewer parameters and environment steps, with ablations confirming module effectiveness and early-training advantages.

Significance. If the results and causal mechanism hold, this would be significant for MTRL by demonstrating that targeted optimization fixes within on-policy PPO can rival or exceed off-policy SAC approaches without modular architectures or large models, with credit for the efficiency claims (fewer params/steps) and ablations providing insights into module interactions. It challenges the prevailing SAC dominance by highlighting critic conditioning as a central bottleneck.

major comments (2)

[Abstract] Abstract: the diagnosis of 'critic-side gradient ill-conditioning' as the primary cause of tail-task stalling is presented without any reported measurements (e.g., gradient condition numbers, per-task critic Jacobian norms, or Hessian spectra) showing worse conditioning in standard PPO versus TOPPO; absent this, performance deltas could arise from incidental effects such as altered step sizes or task weighting rather than the posited mechanism.
[Method] Method section (implied by abstract description of reformulation): no equations, derivations, or pseudocode are supplied in the abstract to define how the Critic Balancing modules alter gradient flow or value updates, making it impossible to verify the claim that they directly improve conditioning without side effects on stability or other tasks.

minor comments (2)

[Abstract] Abstract: specify the exact number of runs, random seeds, and whether error bars or statistical tests (e.g., t-tests) support the mean/tail superiority claims over baselines.
[Experiments] Experiments: define or cite 'Meta-World+' clearly if it is a modified version of the standard benchmark, and ensure all tables/figures report parameter counts and environment steps explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract's presentation of our core claims and mechanism. We address each major comment below, providing clarifications and indicating revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the diagnosis of 'critic-side gradient ill-conditioning' as the primary cause of tail-task stalling is presented without any reported measurements (e.g., gradient condition numbers, per-task critic Jacobian norms, or Hessian spectra) showing worse conditioning in standard PPO versus TOPPO; absent this, performance deltas could arise from incidental effects such as altered step sizes or task weighting rather than the posited mechanism.

Authors: We agree that explicit measurements of gradient conditioning would more directly substantiate the proposed mechanism over alternative explanations. The full manuscript already contains supporting evidence via ablation studies, per-task performance curves, and early-training dynamics that are consistent with improved critic conditioning under TOPPO. In the revision we will add a new subsection reporting gradient condition numbers and per-task critic Jacobian norms computed on Meta-World+ for both baseline PPO and TOPPO, confirming the conditioning improvement. We will also insert a brief clause in the abstract directing readers to this analysis. revision: yes
Referee: [Method] Method section (implied by abstract description of reformulation): no equations, derivations, or pseudocode are supplied in the abstract to define how the Critic Balancing modules alter gradient flow or value updates, making it impossible to verify the claim that they directly improve conditioning without side effects on stability or other tasks.

Authors: Abstracts are intentionally concise and do not contain equations or pseudocode; the complete mathematical formulation, derivations, and algorithm pseudocode for the Critic Balancing modules appear in Section 3 of the manuscript, where we explicitly show the modified critic loss, the balancing regularizers, and their effect on the gradient with respect to the shared value function. To address the concern, we have added one sentence to the abstract that points readers to the detailed equations and stability analysis in the method section. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method with external benchmark validation

full rationale

The paper proposes TOPPO as an empirical reformulation of PPO using Critic Balancing modules to address a diagnosed optimization issue in MTRL. No derivation chain, equations, or first-principles results are presented that reduce claimed performance gains to inputs defined by the method itself (e.g., no fitted parameters renamed as predictions, no self-definitional loops, and no load-bearing self-citations). Ablations and Meta-World+ comparisons provide independent empirical content against external baselines, making the central claims falsifiable outside the paper's own constructs. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract introduces Critic Balancing as a set of modules but does not enumerate free parameters, background axioms, or new postulated entities; evaluation is limited by absence of full text.

pith-pipeline@v0.9.0 · 5548 in / 1194 out tokens · 37809 ms · 2026-05-13T01:43:42.351249+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We diagnose that PPO in MTRL suffers from a previously overlooked issue: critic-side gradient ill-conditioning... PopArt, LN-c, FairGrad (α=1), PCGrad-a
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

per-task critic gradient norms span ∼497×... FairGrad (α=1) fixes the combined-gradient L2 energy at √K

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Andrychowicz, A

arXiv:2006.05990. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page arXiv 2006
[2]

arXiv:2402.15638. Tony F. Chan, Gene H. Golub, and Randall J. LeVeque. Algorithms for computing the sample variance: Analysis and recommendations.The American Statistician, 37(3):242–247,

work page arXiv
[3]

Restricted Admissible Limit for Domains of Finite Type

doi: 10.1080/00031305.1983.10483115. Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 794–803. PMLR,

work page Pith review doi:10.1080/00031305.1983.10483115 1983
[4]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine

arXiv:2005.12729. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1861–1870. PMLR,

work page arXiv 2005
[5]

TD-MPC2: Scalable, Robust World Models for Continuous Control

arXiv:2310.16828. Ahmed Hendawy, Jan Peters, and Carlo D’Eramo. Multi-task reinforcement learning with mix- ture of orthogonal experts. InInternational Conference on Learning Representations,

work page internal anchor Pith review arXiv
[6]

Hendawy, J

arXiv:2311.11385. Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multi-task deep reinforcement learning with PopArt. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3796–3803,

work page arXiv
[7]

Viraj Joshi, Zifan Xu, Bo Liu, Peter Stone, and Amy Zhang

arXiv:1809.04474. Viraj Joshi, Zifan Xu, Bo Liu, Peter Stone, and Amy Zhang. Benchmarking massively parallelized multi-task reinforcement learning for robotics tasks.Reinforcement Learning Journal,

work page arXiv
[8]

arXiv:2507.23172; presented at the Reinforcement Learning Conference (RLC)

work page arXiv
[9]

Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.arXiv preprint arXiv:2410.09754, 2024

arXiv:2410.09754. Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. InAdvances in Neural Information Processing Systems, volume 34,

work page arXiv
[10]

Shikun Liu, Edward Johns, and Andrew J

arXiv:2306.03792. Shikun Liu, Edward Johns, and Andrew J. Davison. End-to-end multi-task learning with attention. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1871–1880,

work page arXiv
[11]

Meta-World+: An Improved, Standardized,

arXiv:2505.11289. Jeonghoon Mo and Jean Walrand. Fair end-to-end window-based congestion control.IEEE/ACM Transactions on Networking, 8(5):556–567,

work page arXiv
[12]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

arXiv:1506.02438. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Lingfeng Sun, Haichao Zhang, Wei Xu, and Masayoshi Tomizuka

arXiv:2102.06177. Lingfeng Sun, Haichao Zhang, Wei Xu, and Masayoshi Tomizuka. PaCo: Parameter-compositional multi-task reinforcement learning. InAdvances in Neural Information Processing Systems, vol- ume 35,

work page arXiv
[14]

arXiv:2210.11653. Hado P. van Hasselt, Arthur Guez, Matteo Hessel, V olodymyr Mnih, and David Silver. Learning values across many orders of magnitude. InAdvances in Neural Information Processing Systems, volume 29,

work page arXiv
[15]

arXiv:1602.07714. B. P. Welford. Note on a method for calculating corrected sums of squares and products.Technomet- rics, 4(3):419–420,

work page arXiv
[16]

Ruihan Yang, Huazhe Xu, Yi Wu, and Xiaolong Wang

doi: 10.1080/00401706.1962.10490022. Ruihan Yang, Huazhe Xu, Yi Wu, and Xiaolong Wang. Multi-task reinforcement learning with soft modularization. InAdvances in Neural Information Processing Systems, volume 33,

work page doi:10.1080/00401706.1962.10490022 1962
[17]

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn

arXiv:2003.13661. Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InAdvances in Neural Information Processing Systems, volume 33, 2020a. Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evalua...

work page arXiv 2003
[18]

argue that across supervised MTL benchmarks and Meta-World multi-task SAC, the simple uniform sum of losses (unitary scalarization) matches specialized multi-task optimizers once standard regularization is matched; our diagnostic evidence (Figure 2: critic per-task gradient-norm spread 497×) is on MT-PPO specifically, a regime they do not test, where loss...

work page 2016
[19]

success rate

We retain PPO’s value clip and apply a post- aggregation gradient-norm clip uniformly across all configurations. Full numerical hyperparameter values are in Table B.1; PPO-specific adaptations made when porting SAC-based gradient-surgery methods (PCGrad, CAGrad, FairGrad) to PPO are documented in Section B.4. Evaluation protocol.All numbers reported in th...

work page 2025
[20]

The full 4-metric breakdown of every reported configuration is in Table D.10, cross-paper convention conversions are in Section D.7, and per-task breakdowns are in Section D.6

and IQM ± std against the Meta-World+ V1 baselines (Table 3); for our rows, both views are computed from the same 10-seed run set. The full 4-metric breakdown of every reported configuration is in Table D.10, cross-paper convention conversions are in Section D.7, and per-task breakdowns are in Section D.6. Compute resources.TOPPO is trained on a single NV...

work page 2025
[21]

parameter count

The three FairGrad solver entries below the per-side α rows (ymax, η, max Newton iters) are fixed solver constants, not swept. Parameter MT10 MT50 Network architecture (fully task-shared except per-taskσ) actorθ a hidden sizes[400,400,400] [400,400,400] actor output projection sharedLinear(400,|A|)forµsharedLinear(400,|A|)forµ criticθ c hidden sizes[400,4...

work page 2025
[22]

The actor/critic partition is exposed to the combiner by a parameter-filter helper that slices the per-task gradient list to each side’s flat slab before invoking it

Per-task σi rows are decoupled from surgery.The per-task log-stddev σi ∈R |A| receives gradient only from task i, so any per-task reweighting acts on row i as a scalar rescale and never aggregates across tasks; only the fully task-shared parameters θa,θ c feed the per-side combiner. The actor/critic partition is exposed to the combiner by a parameter-filt...

work page 1962
[23]

per-task contribution 18 xi

with the bookkeeping that the main-text version folds into surrounding loop headers, mirroring the released implementation. The differences from vanilla MT-PPO are localized to four points: PopArt update before the minibatch loop (line 3), task-stratified minibatching (line 6), per-task loss decomposition with one backward per task (lines 9– 12), and the ...

work page 2000
[24]

E.3), even when the headline V2 FG-c vs

to matter more in environments with more extreme per-task return-scale heterogeneity than V2 (McLean et al., 2025 App. E.3), even when the headline V2 FG-c vs. PCGrad-c gap is within seed noise; both combiners select different points in the same shared conecone +{gc i }(Theorem C.1, Section C.3). Table D.2:Per-side combiner asymmetry on MT50 (10 seeds, 10...

work page 2025
[25]

D.5 V1 reward sensitivity Meta-World V1 rewards have substantially wider per-task scale heterogeneity than V2 [McLean et al., 2025], exactly the regime in which Corollary 1’s scope language predicts PopArt’s non-rescaling channels (Welford running statistics, Adam moments, downstream actor effects through ˆAi) become individually load-bearing. We re-ran t...

work page 2025
[26]

Meta-World+ baselines are reported in Section 5.3 and Tables 3 and D.3

The Corollary 1 signature inversion (PopArt-marginal sign flip across V1/V2) and the V1 head-to- head numbers vs. Meta-World+ baselines are reported in Section 5.3 and Tables 3 and D.3. Two appendix-only mechanism points add to that. Why FG-c without PopArt hurts on V1.Without PopArt’s value-target normalization, V1’s wide per-task return scales push some...

work page 2025
[27]

Appendix D. V1’s wider per- task return-scale heterogeneity flips the Corollary 1 signature: PopArt’s marginal over LN-c+FG-c grows to 17.01%↑ (rows 4→5) versus 0.32%↑ on V2, the regime in which PopArt’s non-rescaling channels become load-bearing rather than absorbed at the FG-c aggregator, and TOPPO clears every V1 SAC baseline by≥17.6pp under matched-me...

work page 2025
[28]

per-task

0 20 40 60 80 100Mean success rate over worst-k tasks (%) (b) MT50 V1 worst-k tail (ours only) k = 10 tail: 4.9% k = 20 tail: 49.2% k = 50 (full mean): 79.7% Meta-World V1 rewards; 100M total environment steps. Ours (n = 10 seeds): [400, 400, 400] shared MLPs, 717K total parameters. (a) right-margin markers: Meta-World+ V1 MT50 IQM at full budget (MTMHSAC...

work page 2025
[29]

Reporting all four lets readers convert our numbers to any baseline’s protocol (Table D.9)

10 (E) max-over-time per (seed,task) none 5 V1 Soft-Mod (2020) 3 (E) final-policy 100-ep dedicated pass (graphical only) 100 V1 PCGrad (2020) ? – figure eyeball, no num table – ? V1 CAGrad (2021) 10 (C) mean-curve peak (App B.3)SEM1 V1* CARE (2021) 10 (C) mean-curve peak SEM 5 V1 PaCo (2022) 10 (A) final 20M, explicit anti-peak std 10V2 FAMO (2023) 10 (C)...

work page 2020