pith. machine review for the scientific record. sign in

arxiv: 2605.01643 · v3 · submitted 2026-05-02 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

AI Alignment via Incentives and Correction

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords ai alignmentincentive designdeterrence modelsbilevel optimizationllm coding pipelinesreward adaptationauditor oversightagentic ai
0
0 comments X

The pith

Adaptive reward design in a solver-auditor game reduces hallucinated errors in LLM coding pipelines compared to static rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that AI misalignment arises when a solver agent can gain from producing persuasive but incorrect outputs while an auditor must weigh the cost of checking them, creating a fixed-point equilibrium rather than a simple training signal. If this view holds, then rewards attached only to final answers miss the joint dynamics of error, detection, and oversight that actually shape behavior in agentic pipelines. The authors formalize the setup as a principal choosing rewards over all possible correction outcomes, turning the problem into a bilevel optimization that must be solved for the induced behavioral equilibrium. A bandit algorithm searches over reward profiles using noisy interaction data. Experiments on an LLM coding pipeline show that the resulting adaptive profiles sustain auditor pressure and cut incorrect attempts more effectively than hand-designed static rewards.

Core claim

Alignment in agentic AI pipelines arises as a fixed point of a two-agent game in which a principal sets rewards for joint solver-auditor correction events; stronger penalties deter solver errors but may weaken the auditor's incentive to inspect, so the optimal rewards are those that induce a desirable behavioral equilibrium, which can be located by bandit search over reward profiles.

What carries the argument

the bilevel optimization over rewards for joint correction outcomes in the solver-auditor model, with an outer bandit loop that searches for profiles inducing aligned equilibria from interaction feedback

If this is right

  • Adaptive rewards can reduce the rate of hallucinated incorrect attempts in LLM coding tasks.
  • Auditor monitoring incentives remain active even as the solver's performance improves.
  • Post-training signals should incorporate the full correction event rather than the final answer alone.
  • The bilevel approach yields principal-aligned outcomes superior to those from static hand-designed rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same incentive modeling could apply to other agentic AI systems involving verification steps, such as multi-step reasoning or tool-using agents.
  • It suggests that reward design in reinforcement learning from human feedback might benefit from explicitly including auditor or verifier cost dynamics.
  • If the equilibria prove robust, the framework could inform governance approaches that treat AI oversight as an enforcement problem rather than pure capability shaping.

Load-bearing premise

Real LLM solver-auditor interactions in coding pipelines behave like the strategic economic deterrence model, allowing the bilevel reward design to produce stable aligned equilibria.

What would settle it

Applying the adaptive reward profiles to an LLM coding pipeline and observing no reduction in hallucinated incorrect attempts or a drop in auditor inspection rates relative to static hand-designed rewards.

Figures

Figures reproduced from arXiv: 2605.01643 by Elad Hazan, Joshua Lin, Mark Braverman, Rohit Agarwal.

Figure 1
Figure 1. Figure 1: Many MAB algorithms see similar performance, but Thompson profiles converge early. view at source ↗
Figure 2
Figure 2. Figure 2: One-step local incentives over pure states, view at source ↗
Figure 3
Figure 3. Figure 3: Abstract model of rewards: the space of feasible model weights ( view at source ↗
Figure 4
Figure 4. Figure 4: Validation PV over training for three solver–auditor co-training variants with the same view at source ↗
Figure 5
Figure 5. Figure 5: Hallucination-rate trajectories on the same held-out 200 view at source ↗
Figure 6
Figure 6. Figure 6: Per-outer-iteration solver pass rates for the Thompson controller (red) and the compute view at source ↗
Figure 7
Figure 7. Figure 7: Downstream LiveCodeBench comparison for the base model, the final fixed-binary view at source ↗
Figure 8
Figure 8. Figure 8: Training-side behavior trajectories for the Thompson controller, fixed-binary, and fixed view at source ↗
Figure 9
Figure 9. Figure 9: Schedule-shape ablation training PV trajectories versus total training steps (solver plus view at source ↗
Figure 10
Figure 10. Figure 10: Schedule-shape ablation at matched compute, shown as a two-panel final-checkpoint view at source ↗
Figure 11
Figure 11. Figure 11: Agentic-revision benchmark summary. Left: first-pass versus post-revision overall pass view at source ↗
read the original abstract

We study AI alignment through the lens of law-and-economics models of deterrence and enforcement. In these models, misconduct is not treated as an external failure, but as a strategic response to incentives: an actor weighs the gain from violation against the probability of detection and the severity of punishment. We argue that the same logic arises naturally in agentic AI pipelines. A solver may benefit from producing a persuasive but incorrect answer, hiding uncertainty, or exploiting spurious shortcuts, while an auditor or verifier must decide whether costly monitoring is worthwhile. Alignment is therefore a fixed-point problem: stronger penalties may deter solver misbehavior, but they can also reduce the auditor's incentive to inspect, since auditing then mainly incurs cost on a population that appears increasingly aligned. This perspective also changes what should count as a post-training signal. Standard feedback often attaches reward to the final answer alone, but a solver-auditor pipeline exposes the full correction event: whether the solver erred, whether the auditor inspected, whether the error was caught, and whether oversight incentives remained active. We formalize this interaction in a two-agent model in which a principal chooses rewards over joint correction outcomes, inducing both solver behavior and auditor monitoring. Reward design is therefore a bilevel optimization problem: rewards are judged not by their immediate semantic meaning, but by the behavioral equilibrium they induce. We propose a bandit-based outer-loop procedure for searching over reward profiles using noisy interaction feedback. Experiments on an LLM coding pipeline show that adaptive reward profiles can maintain useful oversight pressure and improve principal-aligned outcomes relative to static hand-designed rewards, including a substantial reduction in hallucinated incorrect attempts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper models AI alignment in agentic pipelines as a strategic deterrence problem between a solver (which may produce incorrect outputs for gain) and an auditor (which decides on costly monitoring). It formalizes this as a two-agent game where a principal designs rewards over joint outcomes (error, inspection, correction) via bilevel optimization to induce aligned behavioral equilibria, proposes a bandit-based search over reward profiles using interaction feedback, and reports experimental gains on an LLM coding pipeline: adaptive rewards maintain oversight pressure and reduce hallucinated incorrect attempts relative to static hand-designed rewards.

Significance. If the central claim holds, the work supplies a concrete mechanism for post-training alignment that treats oversight as an endogenous equilibrium rather than an external constraint, potentially improving robustness in multi-agent LLM systems. The explicit use of law-and-economics deterrence logic and the bandit outer loop for reward search are distinctive contributions; the empirical demonstration on a coding pipeline provides a falsifiable testbed even if the equilibrium-matching analysis remains incomplete.

major comments (2)
  1. [Experiments] Experiments section: the reported improvements (reduction in hallucinated attempts, higher aligned outcomes) are measured only on final pipeline metrics; no data or analysis is supplied on whether observed inspection frequencies, error rates, or correction-event distributions match the Nash or fixed-point conditions derived from the two-agent bilevel model. Without this check, the performance lift cannot be attributed to the deterrence mechanism rather than any adaptive reward schedule.
  2. [Model formalization] Model formalization (Section 3): the bilevel reward-design problem is described conceptually but the manuscript supplies neither the explicit payoff matrices nor the equilibrium conditions (e.g., the fixed-point equation relating solver misbehavior probability to auditor inspection cost under the chosen reward profile). This absence prevents verification that the bandit search actually converges to the claimed stable aligned equilibria.
minor comments (2)
  1. [Notation] Notation for the joint outcome space (error, inspection, correction) is introduced in the abstract but never tabulated or given consistent symbols in the main text, making it difficult to map the described correction events to the reward vector.
  2. [Bandit procedure] The bandit procedure is said to use 'noisy interaction feedback,' yet no description is given of the reward-estimation variance, exploration schedule, or convergence criterion used in the outer loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The two major comments identify important gaps in connecting the empirical results to the theoretical model and in making the formalization fully explicit. We address each below and commit to revisions that strengthen these links without altering the core claims.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the reported improvements (reduction in hallucinated attempts, higher aligned outcomes) are measured only on final pipeline metrics; no data or analysis is supplied on whether observed inspection frequencies, error rates, or correction-event distributions match the Nash or fixed-point conditions derived from the two-agent bilevel model. Without this check, the performance lift cannot be attributed to the deterrence mechanism rather than any adaptive reward schedule.

    Authors: We agree that the current experimental presentation focuses on end-to-end pipeline metrics and does not directly validate the equilibrium predictions. In the revision we will add a dedicated subsection that logs and reports the empirical frequencies of inspection events, solver error rates, and correction outcomes across reward profiles. These will be compared against the fixed-point conditions implied by the bilevel model (solver misbehavior probability as a function of auditor inspection cost and the chosen reward vector). This analysis will use the same interaction traces already collected, allowing readers to assess whether the observed behavior is consistent with the deterrence equilibrium rather than generic adaptivity. revision: yes

  2. Referee: [Model formalization] Model formalization (Section 3): the bilevel reward-design problem is described conceptually but the manuscript supplies neither the explicit payoff matrices nor the equilibrium conditions (e.g., the fixed-point equation relating solver misbehavior probability to auditor inspection cost under the chosen reward profile). This absence prevents verification that the bandit search actually converges to the claimed stable aligned equilibria.

    Authors: The model is intentionally kept at a level that applies across different agentic pipelines, which is why explicit numerical matrices were omitted. We accept that this reduces verifiability. In the revised Section 3 we will insert a new subsection that (i) defines the two-agent payoff matrix over the joint outcomes (error, inspection, correction, reward), (ii) states the fixed-point equation that equates the solver’s best-response misbehavior probability to the auditor’s best-response inspection probability under a given reward profile, and (iii) shows how the outer bandit loop is designed to search for reward vectors that induce equilibria satisfying the principal’s alignment objective. This will make the convergence claim directly checkable. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no circular reductions identified

full rationale

The paper draws its core framing from external law-and-economics deterrence models, introduces an independent two-agent formalization of solver-auditor interactions as a bilevel reward-design problem, and proposes a bandit outer loop as a search procedure. Experimental results on the LLM coding pipeline are reported as empirical outcomes of applying this procedure, not as quantities that reduce by construction to fitted parameters or prior self-citations. No equations, uniqueness theorems, or ansatzes are shown to be self-referential or load-bearing only via author-overlapping citations. The central claim therefore retains independent content beyond its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that AI pipeline behavior can be usefully modeled as strategic economic actors responding to detection and punishment probabilities.

axioms (1)
  • domain assumption AI agents in solver-auditor pipelines act strategically to maximize their individual rewards, analogous to economic actors in deterrence models.
    This modeling choice enables the fixed-point and bilevel formulation described in the abstract.

pith-pipeline@v0.9.0 · 5591 in / 1315 out tokens · 46476 ms · 2026-05-12T04:48:07.623348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 6 internal anchors

  1. [1]

    Analysis of thompson sampling for the multi-armed bandit problem

    Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. InProceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 39.1–39.26. PMLR, 2012

  2. [2]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mane. Concrete problems in AI safety.arXiv preprint arXiv:1606.06565, 2016

  3. [3]

    Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235–256, 2002

    Peter Auer, Nicol` o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235–256, 2002

  4. [4]

    Schapire

    Peter Auer, Nicol` o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem.SIAM Journal on Computing, 32(1):48–77, 2002

  5. [5]

    Program synthesis with large language models, 2021

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021

  6. [6]

    Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi

    Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv, 2025

  7. [7]

    Gary S. Becker. Crime and punishment: An economic approach.Journal of Political Economy, 76(2):169–217, 1968

  8. [8]

    Becker and George J

    Gary S. Becker and George J. Stigler. Law enforcement, malfeasance, and compensation of enforcers.Journal of Legal Studies, 3(1):1–18, 1974

  9. [9]

    Adversarial reward auditing for active detection and mitigation of reward hacking, 2026

    Mohammad Beigi, Ming Jin, Junshan Zhang, Qifan Wang, and Lifu Huang. Adversarial reward auditing for active detection and mitigation of reward hacking.arXiv preprint arXiv:2602.01750, 2026

  10. [10]

    Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschen- brenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390, 2023

  11. [11]

    Principal-driven reward design and agent policy alignment via bilevel-rl

    Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Furong Huang, and Mengdi Wang. Principal-driven reward design and agent policy alignment via bilevel-rl. InInteractive Learning with Implicit Human Feedback Workshop (ILHF), ICML, 2023

  12. [12]

    An empirical evaluation of thompson sampling

    Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. InAdvances in Neural Information Processing Systems, 2011

  13. [13]

    Supervising strong learners by amplifying weak experts

    Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts.arXiv preprint arXiv:1810.08575, 2018. 16

  14. [14]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, 2017

  15. [15]

    How might we safely pass the buck to AI?, 2025

    Josh Clymer. How might we safely pass the buck to AI?, 2025. AI Alignment Forum / Redwood Research Blog post, February 19, 2025

  16. [16]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  17. [17]

    Escaping the cognitive well: Efficient competition math with off-the-shelf models, 2026

    Xingyu Dang, Rohit Agarwal, Rodrigo Porto, Anirudh Goyal, Liam H Fowl, and Sanjeev Arora. Escaping the cognitive well: Efficient competition math with off-the-shelf models, 2026

  18. [18]

    Liir: Learning individual intrinsic reward in multi-agent reinforcement learning

    Yali Du, Lei Han, Meng Fang, Ji Liu, Tianhong Dai, and Dacheng Tao. Liir: Learning individual intrinsic reward in multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems, volume 32, 2019

  19. [19]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325, 2023

  20. [20]

    Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, 2021

    Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. arXiv preprint arXiv:1908.04734, 2019

  21. [21]

    Flaxman, Adam Tauman Kalai, and H

    Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. InProceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 385–394. Society for Industrial and Applied Mathematics, 2005

  22. [22]

    Wright, and Kevin Leyton-Brown

    Xi Alice Gao, James R. Wright, and Kevin Leyton-Brown. Incentivizing evaluation with peer prediction and limited access to ground truth.Artificial Intelligence, 275:618–638, 2019

  23. [23]

    AI control: Improving safety despite intentional subversion

    Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. AI control: Improving safety despite intentional subversion. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 16295–16336. PMLR, 2024

  24. [24]

    Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

    Charlie Griffin, Louis Thomson, Buck Shlegeris, and Alessandro Abate. Games for AI control: Models of safety evaluations of AI deployment protocols.arXiv preprint arXiv:2409.07985, 2024

  25. [25]

    Hadfield

    Dylan Hadfield-Menell and Gillian K. Hadfield. Incomplete contracting and AI alignment. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 417–422. ACM, 2019

  26. [26]

    Measuring coding challenge competence with apps

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. 17

  27. [27]

    Automating auditing: An ambitious concrete technical research proposal, 2021

    Evan Hubinger. Automating auditing: An ambitious concrete technical research proposal, 2021. LessWrong post, August 11, 2021

  28. [28]

    AI safety via debate

    Geoffrey Irving, Paul Christiano, and Dario Amodei. Ai safety via debate.arXiv preprint arXiv:1805.00899, 2018

  29. [29]

    Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

  30. [30]

    Vempala, and Edwin Zhang

    Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. Evaluating Large Language Models for accuracy incentivizes hallucinations.Nature, 2026

  31. [31]

    Collusion in hierarchical agency.Econometrica, 61(3):629– 656, 1993

    Fred Kofman and Jacques Lawarree. Collusion in hierarchical agency.Econometrica, 61(3):629– 656, 1993

  32. [32]

    Goal misgeneralization in deep reinforcement learning

    Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, and David Krueger. Goal misgeneralization in deep reinforcement learning. InInternational Conference on Machine Learning, 2022

  33. [33]

    Cambridge University Press, Cam- bridge, United Kingdom, 2020

    Tor Lattimore and Csaba Szepesv´ ari.Bandit Algorithms. Cambridge University Press, Cam- bridge, United Kingdom, 2020

  34. [34]

    Rlaif vs

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback, 2024

  35. [35]

    Scalable agent alignment via reward modeling: a research direction

    Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: A research direction.arXiv preprint arXiv:1811.07871, 2018

  36. [36]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  37. [37]

    Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc

  38. [38]

    Wooldridge, and Jiarui Gan

    Julian Manyika, Michael J. Wooldridge, and Jiarui Gan. Mechanism design for alignment via human feedback, 2025. Second Workshop on Models of Human Feedback for AI Alignment, ICML 2025

  39. [39]

    McAleese, R

    Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, and Jan Leike. LLM critics help catch LLM bugs.arXiv preprint arXiv:2407.00215, 2024

  40. [40]

    Dilip Mookherjee and Ivan P. L. Png. Monitoring vis-` a-vis investigation in enforcement of law. American Economic Review, 82(3):556–565, 1992

  41. [41]

    Random gradient-free minimization of convex functions

    Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017. 18

  42. [42]

    Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, 2022

  43. [43]

    Adaptive reinforcement learning with LLM-augmented reward functions.TechRxiv, dec 2023

    Alex Place. Adaptive reinforcement learning with LLM-augmented reward functions.TechRxiv, dec 2023

  44. [44]

    Procaccia

    Itai Shapira, Gerdus Benade, and Ariel D. Procaccia. How rlhf amplifies sycophancy, 2026

  45. [45]

    How to prevent collusion when using untrusted models to monitor each other,

    Buck Shlegeris. How to prevent collusion when using untrusted models to monitor each other,

  46. [46]

    LessWrong post, September 25, 2024

  47. [47]

    George J. Stigler. The optimum enforcement of laws.Journal of Political Economy, 78(3):526– 536, 1970

  48. [48]

    Beyond verifiable rewards: Scaling reinforcement learning for language models to unverifiable data, 2025

    Yunhao Tang, Sid Wang, Lovish Madaan, and R´ emi Munos. Beyond verifiable rewards: Scaling reinforcement learning for language models to unverifiable data, 2025

  49. [49]

    Thompson

    William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3–4):285–294, 1933

  50. [50]

    Hierarchies and bureaucracies: On the role of collusion in organizations.The Journal of Law, Economics, and Organization, 2(2):181–214, 1986

    Jean Tirole. Hierarchies and bureaucracies: On the role of collusion in organizations.The Journal of Law, Economics, and Organization, 2(2):181–214, 1986

  51. [51]

    Vazquez, and Julian Michael

    Yoav Tzfati, McKenna Fitzgerald, Juan J. Vazquez, and Julian Michael. Evaluating oversight robustness with incentivized reward hacking, 2024. ICLR 2025 withdrawn submission

  52. [52]

    Automated weak-to-strong researcher

    Jiaxin Wen, Liang Qiu, Joe Benton, Jan Hendrik Kirchner, and Jan Leike. Automated weak-to-strong researcher. Anthropic Alignment Science Blog, April 2026

  53. [53]

    tp” denotestrue positiveand “fp

    Jiachen Yang, Ang Li, Mehrdad Farajtabar, Peter Sunehag, Edward Hughes, and Hongyuan Zha. Learning to incentivize other learning agents, 2020. 19 A Extension of the Theoretical Model to Non-Verifiable Settings with Possible Hallucination We extend the model of Section 3 to the case of a noisy solver and noisy judge protocol, under only weak accuracy assum...