pith. sign in

arxiv: 2605.09109 · v1 · submitted 2026-05-09 · 💻 cs.AI

When (and How) to Trust the Expert: Diagnosing Query-Time Expert-Guided Reinforcement Learning

Pith reviewed 2026-05-12 02:32 UTC · model grok-4.3

classification 💻 cs.AI
keywords expert-guided reinforcement learningquery-time methodsfailure modesdecision rulecontinuous controlsuboptimal expertsreinforcement learning
0
0 comments X

The pith

Expert-guided RL methods each succeed only in specific regimes of expert quality and task structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares several query-time methods that use a suboptimal expert controller to guide reinforcement learning in continuous control tasks. By running all methods on the same SAC backbone with shared hyperparameter tuning and a sweep of expert degradations, it identifies three failure modes missed by isolated tests. These modes are a critic blind spot on experts near the RL ceiling, residual saturation on poor experts, and buffer poisoning in handoff approaches. No method wins in every setting, and none beats RL-near-ceiling experts within the 1M-step budget on the tested tasks. The comparison is turned into a practical decision rule that uses three observables available before training to pick the right method for a given expert and task.

Core claim

Harmonized benchmarking of query-time expert-guided RL methods on a shared backbone with degradation sweeps over undertuning, action bias, and observation noise shows that each method has regime-specific failure modes, with no single method dominating across all cases and no method clearing RL-near-ceiling experts within a 1M-step budget; this spread is converted into a testable decision rule keyed on expert quality, task termination, and perturbation type.

What carries the argument

The taxonomy of three failure modes (critic blind spot under argmax-plus-bootstrap, residual saturation, and warm-start buffer poisoning) together with the decision rule based on three pre-training observables that selects the appropriate guidance method.

If this is right

  • Practitioners should measure how close the expert is to the RL performance ceiling before selecting a guidance method.
  • Training-time handoff methods risk collapse if the expert is undertuned at deployment time.
  • Design choices in gate form and scoring rules can be tuned separately to mitigate specific failure modes.
  • On tasks with near-ceiling experts, query-time methods may need longer budgets or different architectures to produce gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • In real deployments, pre-deployment checks of expert quality against RL baselines become a necessary step for method selection.
  • The observed limit on near-ceiling experts may be budget-dependent rather than fundamental, suggesting tests at higher step counts.
  • The decision rule could be implemented as an automated selector in RL toolkits for users who have an existing expert controller.

Load-bearing premise

The three degradation axes of undertuning, action bias, and observation noise along with the chosen continuous-control environments represent the main variations practitioners will encounter when using expert-guided RL.

What would settle it

A controlled experiment in which one query-time method consistently surpasses the expert on RL-near-ceiling tasks such as FourTank within the 1M-step budget would falsify the observed performance limit.

Figures

Figures reproduced from arXiv: 2605.09109 by Philippe Preux, Riad Akrour, Yann Berthelot.

Figure 1
Figure 1. Figure 1: Per-env asymptotic Expert-Normalized Advantage at [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-environment perturbation sweep (4×3 grid): rows are environments, columns are pertur￾bation types (training-time expert undertuning; deployment-time expert action bias and observation noise). Each panel groups the five expert-using methods plus no-expert SAC; for every method, bars left-to-right are clean (darkest) and increasing σ. Bar height is ENA at the perturbed configuration (50 seeds); each row … view at source ↗
Figure 3
Figure 3. Figure 3: Uncensored per-env headline (companion to Figure [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-env aggregated rliable view: per-method IQM (tick) and stratified-bootstrap IQR [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-environment learning curves. IQM across 100/50 seeds (control/locomotion) with [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-environment expert bias ∆ = J − Jexp. The horizontal axis at zero marks expert parity. pessimism + wide gate (κ= 4.05, τ = 8.51); FourTank (no termination, near-ceiling expert) hits the lower κ bound with moderate τ . Task Tuned κ Tuned τ Termination? Expert quality Plane3DCircle 4.05 8.51 yes (crash) weak (50% of best RL) GlassFurnace 2.61 0.24 no moderate (80% of best RL) CheetahRun 1.16 0.17 no weak… view at source ↗
Figure 7
Figure 7. Figure 7: Sample efficiency: per-method IQM (tick) and 95% bootstrap CI (rectangle) of [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-environment perturbation sweep, ENA scale with per-bar percent annotations (annotated [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-environment perturbation sweep, raw IQM final return (un-normalized companion to [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Plane3DCircle ablation bars. Reference (green) is EDGE at the tuned configuration; each [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: CheetahRun ablation bars (companion to Figure [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: FourTank ablation bars (companion to Figure [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Per-environment renderings for the three non-standard benchmark tasks. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
read the original abstract

Many continuous-control problems ship with a competent but suboptimal controller (a tuned PID, a hand-designed gait). A growing family of methods uses such controllers as queryable experts during RL, but each method has been proposed in isolation, on a different benchmark, without imperfect-expert testing. We harmonize the comparison on a shared SAC backbone, common HPO and evaluation protocols, 100/50 seeds per (env, method), and a degradation sweep over expert undertuning, action bias, and observation noise. The comparison surfaces three failure modes single-paper evaluations miss: (F1) a critic blind spot under argmax-plus-bootstrap that drags IBRL below no-expert SAC on experts close to the no-expert-RL ceiling (RL-near-ceiling, distinct from the absolute physical ceiling); (F2) residual saturation on far-from-optimal experts; and (F3) warm-start buffer poisoning that collapses training-time-handoff methods under deployment-time expert undertuning. No single method dominates: each wins on one task-structure regime and fails predictably elsewhere; on RL-near-ceiling experts (FourTank, GlassFurnace) no query-time method clears the expert within our 1M-step budget, leaving open whether this is a fundamental wall or a budget effect. We convert the spread into a testable decision rule keyed on three pre-training observables (expert quality, task termination, perturbation type). The benchmark, taxonomy, and decision rule are the primary contribution; we additionally describe EDGE, a softmax-over-ensemble-LCB design point used to demonstrate that both axes the taxonomy points to (gate form, scoring rule) are individually exploitable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript harmonizes empirical comparisons of query-time expert-guided RL methods on a shared SAC backbone across continuous-control tasks. Using degradation sweeps over expert undertuning, action bias, and observation noise, with 100/50 seeds per (env, method) and common HPO/evaluation protocols, it identifies three failure modes (F1: critic blind spot under argmax-plus-bootstrap on RL-near-ceiling experts; F2: residual saturation on far-from-optimal experts; F3: warm-start buffer poisoning under deployment-time undertuning). No method dominates; each succeeds in specific regimes. The authors derive a testable decision rule from three pre-training observables and introduce EDGE (softmax-over-ensemble-LCB) to show that gate form and scoring rule are exploitable axes.

Significance. If the findings hold, the work is significant for moving expert-guided RL beyond isolated proposals to a comparative taxonomy and practical decision rule that surfaces regime-dependent failure modes missed by single-paper evaluations. The shared backbone, large seed counts, and controlled degradation sweeps provide stronger evidence than prior work. Credit is given for the reproducible protocol and for framing falsifiable predictions around the failure modes and decision rule.

major comments (1)
  1. [Results on FourTank and GlassFurnace (failure mode F1)] The central claim that no query-time method clears the expert on RL-near-ceiling experts (FourTank, GlassFurnace) and the diagnosis of failure mode F1 both rest on classifying these experts as close to the no-expert-RL performance limit within the 1M-step budget. The manuscript provides no evidence that the no-expert SAC baselines have plateaued (e.g., flat learning curves, statistical tests for no further improvement, or convergence diagnostics). If the baselines are still rising at 1M steps, the experts are not demonstrably RL-near-ceiling, which undercuts both F1 and the conclusion that query-time guidance hits a wall.
minor comments (2)
  1. [Abstract] The abstract summarizes findings qualitatively but reports no quantitative metrics, effect sizes, or error bars to support the claims about failure modes, regime-specific wins, or the decision rule.
  2. [Decision rule derivation] The decision rule is described as keyed on three pre-training observables, but the exact thresholds, how the observables are computed from data, and the rule's cross-validation procedure should be stated more explicitly for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The feedback highlights an important point about supporting evidence for our classification of experts as RL-near-ceiling. We address this below and will revise the manuscript to strengthen the presentation.

read point-by-point responses
  1. Referee: [Results on FourTank and GlassFurnace (failure mode F1)] The central claim that no query-time method clears the expert on RL-near-ceiling experts (FourTank, GlassFurnace) and the diagnosis of failure mode F1 both rest on classifying these experts as close to the no-expert-RL performance limit within the 1M-step budget. The manuscript provides no evidence that the no-expert SAC baselines have plateaued (e.g., flat learning curves, statistical tests for no further improvement, or convergence diagnostics). If the baselines are still rising at 1M steps, the experts are not demonstrably RL-near-ceiling, which undercuts both F1 and the conclusion that query-time guidance hits a wall.

    Authors: We agree that explicit evidence of plateauing strengthens the claim. The manuscript defines RL-near-ceiling relative to the 1M-step budget (distinct from absolute optimality) and already notes that longer training might close the gap. In the revision we will add the full learning curves for no-expert SAC on FourTank and GlassFurnace, which show stabilization well before 1M steps, together with a simple statistical check (e.g., no significant improvement over the final 200k steps). This directly supports the F1 diagnosis while preserving the manuscript's cautious framing that the observed wall may be budget-dependent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons and induced decision rule are self-contained

full rationale

The paper performs harmonized empirical evaluations of query-time expert-guided RL methods on continuous-control tasks with controlled expert degradations. Central claims rest on observed performance differences, failure-mode identification (F1-F3), and a decision rule induced from pre-training observables (expert quality, task termination, perturbation type). No equations, parameter fits, or derivations are presented that reduce by construction to the paper's own inputs or self-citations. The RL-near-ceiling classification is an empirical labeling based on 1M-step budgets rather than a self-definitional or fitted prediction. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The decision rule is induced from the empirical spread; its thresholds are therefore fitted to the observed data on the chosen environments and degradation axes. No new physical or mathematical entities are postulated.

free parameters (1)
  • decision-rule thresholds
    The rule is keyed on three observables; the cut-off values that map observables to method choice are necessarily determined from the experimental outcomes.

pith-pipeline@v0.9.0 · 5612 in / 1185 out tokens · 33551 ms · 2026-05-12T02:32:09.917637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) , year =

    Akiba, Takuya and Sano, Shotaro and Yanase, Toshihiko and Ohta, Takeru and Koyama, Masanori , title =. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) , year =

  2. [2]

    2024 , howpublished =

    Bloor, Maximilian and Torraca, Jos\'. 2024 , howpublished =

  3. [3]

    and Bellemare, Marc G

    Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel and Courville, Aaron C. and Bellemare, Marc G. , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  4. [4]

    2020 , howpublished =

    Fu, Justin and Kumar, Aviral and Nachum, Ofir and Tucker, George and Levine, Sergey , title =. 2020 , howpublished =

  5. [5]

    Conference on Robot Learning (CoRL) , year =

    Mandlekar, Ajay and Xu, Danfei and Wong, Josiah and Nasiriany, Soroush and Wang, Chen and Kulkarni, Rohun and Fei-Fei, Li and Savarese, Silvio and Zhu, Yuke and Mart\'in-Mart\'in, Roberto , title =. Conference on Robot Learning (CoRL) , year =

  6. [6]

    and Smith, Laura and Kostrikov, Ilya and Levine, Sergey , title =

    Ball, Philip J. and Smith, Laura and Kostrikov, Ilya and Levine, Sergey , title =. International Conference on Machine Learning (ICML) , year =

  7. [7]

    International Conference on Machine Learning (ICML) , year =

    Haarnoja, Tuomas and Zhou, Aurick and Abbeel, Pieter and Levine, Sergey , title =. International Conference on Machine Learning (ICML) , year =

  8. [8]

    Robotics: Science and Systems (RSS) , year =

    Hu, Hengyuan and Mirchandani, Suvir and Sadigh, Dorsa , title =. Robotics: Science and Systems (RSS) , year =

  9. [9]

    Frontiers in Robotics and AI , volume =

    Shen, Chaoran and Sloth, Christoffer , title =. Frontiers in Robotics and AI , volume =. 2026 , note =

  10. [10]

    IEEE International Conference on Robotics and Automation (ICRA) , year =

    Johannink, Tobias and Bahl, Shikhar and Nair, Ashvin and Luo, Jianlan and Kumar, Avinash and Loskyll, Matthias and Ojea, Juan Aparicio and Solowjow, Eugen and Levine, Sergey , title =. IEEE International Conference on Robotics and Automation (ICRA) , year =

  11. [11]

    International Conference on Machine Learning (ICML) , year =

    Lee, Kimin and Laskin, Michael and Srinivas, Aravind and Abbeel, Pieter , title =. International Conference on Machine Learning (ICML) , year =

  12. [12]

    2020 , howpublished =

    Nair, Ashvin and Gupta, Abhishek and Dalal, Murtaza and Levine, Sergey , title =. 2020 , howpublished =

  13. [13]

    International Conference on Artificial Intelligence and Statistics (AISTATS) , year =

    Ross, St\'ephane and Gordon, Geoffrey and Bagnell, Drew , title =. International Conference on Artificial Intelligence and Statistics (AISTATS) , year =

  14. [14]

    1984 , url =

    Automatic Tuning of Simple Regulators with Specifications on Phase and Amplitude Margins , journal =. 1984 , url =

  15. [15]

    2005 , url =

    Advanced. 2005 , url =

  16. [16]

    Mathematics of Operations Research , volume =

    Russo, Daniel and Van Roy, Benjamin , title =. Mathematics of Operations Research , volume =. 2014 , url =

  17. [17]

    Foundations and Trends in Machine Learning , volume =

    Russo, Daniel and Van Roy, Benjamin and Kazerouni, Abbas and Osband, Ian and Wen, Zheng , title =. Foundations and Trends in Machine Learning , volume =. 2018 , url =

  18. [18]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Osband, Ian and Blundell, Charles and Pritzel, Alexander and Van Roy, Benjamin , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  19. [19]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    An, Gaon and Moon, Seungyong and Kim, Jang-Hyun and Song, Hyun Oh , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  20. [20]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Ghasemipour, Seyed Kamyar Seyed and Gu, Shixiang Shane and Nachum, Ofir , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  21. [21]

    Conservative Bandits , booktitle =

    Wu, Yifan and Shariff, Roshan and Lattimore, Tor and Szepesv\'. Conservative Bandits , booktitle =. 2016 , url =

  22. [22]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Kazerouni, Abbas and Ghavamzadeh, Mohammad and Abbasi-Yadkori, Yasin and Van Roy, Benjamin , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  23. [23]

    IEEE Transactions on Control Systems Technology , volume =

    Johansson, Karl Henrik , title =. IEEE Transactions on Control Systems Technology , volume =. 2000 , url =

  24. [24]

    Journal of Global Optimization , volume =

    Storn, Rainer and Price, Kenneth , title =. Journal of Global Optimization , volume =. 1997 , url =

  25. [25]

    International Conference on Machine Learning (ICML) , year =

    Uchendu, Ikechukwu and Xiao, Ted and Lu, Yao and Zhu, Banghua and Yan, Mengyuan and Simon, Jos\'ephine and Bennice, Matthew and Fu, Chuyuan and Ma, Cong and Jiao, Jiantao and Levine, Sergey and Hausman, Karol , title =. International Conference on Machine Learning (ICML) , year =

  26. [26]

    International Conference on Learning Representations (ICLR) , year =

    Zhang, Haichao and Xu, Wei and Yu, Haonan , title =. International Conference on Learning Representations (ICLR) , year =

  27. [27]

    International Conference on Learning Representations (ICLR) , year =

    Rengarajan, Desik and Vaidya, Gargi and Sarvesh, Akshay and Kalathil, Dileep and Shakkottai, Srinivas , title =. International Conference on Learning Representations (ICLR) , year =

  28. [28]

    International Conference on Learning Representations (ICLR) , year =

    Kostrikov, Ilya and Nair, Ashvin and Levine, Sergey , title =. International Conference on Learning Representations (ICLR) , year =

  29. [29]

    2025 , howpublished =

    Zakka, Kevin and others , title =. 2025 , howpublished =

  30. [30]

    2018 , howpublished =

    Tassa, Yuval and Doron, Yotam and Muldal, Alistair and Erez, Tom and Li, Yazhe and de Las Casas, Diego and Budden, David and Abdolmaleki, Abbas and Merel, Josh and Lefrancq, Andrew and Lillicrap, Timothy and Riedmiller, Martin , title =. 2018 , howpublished =

  31. [31]

    2018 , url =

    Bradbury, James and Frostig, Roy and Hawkins, Peter and Johnson, Matthew James and Leary, Chris and Maclaurin, Dougal and Necula, George and Paszke, Adam and VanderPlas, Jake and Wanderman-Milne, Skye and Zhang, Qiao , title =. 2018 , url =

  32. [32]

    Nature Methods , volume =

    Virtanen, Pauli and others , title =. Nature Methods , volume =. 2020 , url =

  33. [33]

    , title =

    Kalman, Rudolph E. , title =. Bolet\'in de la Sociedad Matem\'atica Mexicana , volume =

  34. [34]

    and Mayne, David Q

    Rawlings, James B. and Mayne, David Q. and Diehl, Moritz , title =

  35. [35]

    Neural Networks , volume =

    Ijspeert, Auke Jan , title =. Neural Networks , volume =. 2008 , url =