arxiv: 2605.01643 · v3 · submitted 2026-05-02 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

AI Alignment via Incentives and Correction

Rohit Agarwal , Joshua Lin , Mark Braverman , Elad Hazan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords ai alignmentincentive designdeterrence modelsbilevel optimizationllm coding pipelinesreward adaptationauditor oversightagentic ai

0 comments

The pith

Adaptive reward design in a solver-auditor game reduces hallucinated errors in LLM coding pipelines compared to static rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that AI misalignment arises when a solver agent can gain from producing persuasive but incorrect outputs while an auditor must weigh the cost of checking them, creating a fixed-point equilibrium rather than a simple training signal. If this view holds, then rewards attached only to final answers miss the joint dynamics of error, detection, and oversight that actually shape behavior in agentic pipelines. The authors formalize the setup as a principal choosing rewards over all possible correction outcomes, turning the problem into a bilevel optimization that must be solved for the induced behavioral equilibrium. A bandit algorithm searches over reward profiles using noisy interaction data. Experiments on an LLM coding pipeline show that the resulting adaptive profiles sustain auditor pressure and cut incorrect attempts more effectively than hand-designed static rewards.

Core claim

Alignment in agentic AI pipelines arises as a fixed point of a two-agent game in which a principal sets rewards for joint solver-auditor correction events; stronger penalties deter solver errors but may weaken the auditor's incentive to inspect, so the optimal rewards are those that induce a desirable behavioral equilibrium, which can be located by bandit search over reward profiles.

What carries the argument

the bilevel optimization over rewards for joint correction outcomes in the solver-auditor model, with an outer bandit loop that searches for profiles inducing aligned equilibria from interaction feedback

If this is right

Adaptive rewards can reduce the rate of hallucinated incorrect attempts in LLM coding tasks.
Auditor monitoring incentives remain active even as the solver's performance improves.
Post-training signals should incorporate the full correction event rather than the final answer alone.
The bilevel approach yields principal-aligned outcomes superior to those from static hand-designed rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same incentive modeling could apply to other agentic AI systems involving verification steps, such as multi-step reasoning or tool-using agents.
It suggests that reward design in reinforcement learning from human feedback might benefit from explicitly including auditor or verifier cost dynamics.
If the equilibria prove robust, the framework could inform governance approaches that treat AI oversight as an enforcement problem rather than pure capability shaping.

Load-bearing premise

Real LLM solver-auditor interactions in coding pipelines behave like the strategic economic deterrence model, allowing the bilevel reward design to produce stable aligned equilibria.

What would settle it

Applying the adaptive reward profiles to an LLM coding pipeline and observing no reduction in hallucinated incorrect attempts or a drop in auditor inspection rates relative to static hand-designed rewards.

Figures

Figures reproduced from arXiv: 2605.01643 by Elad Hazan, Joshua Lin, Mark Braverman, Rohit Agarwal.

**Figure 1.** Figure 1: Many MAB algorithms see similar performance, but Thompson profiles converge early. view at source ↗

**Figure 2.** Figure 2: One-step local incentives over pure states, view at source ↗

**Figure 3.** Figure 3: Abstract model of rewards: the space of feasible model weights ( view at source ↗

**Figure 4.** Figure 4: Validation PV over training for three solver–auditor co-training variants with the same view at source ↗

**Figure 5.** Figure 5: Hallucination-rate trajectories on the same held-out 200 view at source ↗

**Figure 6.** Figure 6: Per-outer-iteration solver pass rates for the Thompson controller (red) and the compute view at source ↗

**Figure 7.** Figure 7: Downstream LiveCodeBench comparison for the base model, the final fixed-binary view at source ↗

**Figure 8.** Figure 8: Training-side behavior trajectories for the Thompson controller, fixed-binary, and fixed view at source ↗

**Figure 9.** Figure 9: Schedule-shape ablation training PV trajectories versus total training steps (solver plus view at source ↗

**Figure 10.** Figure 10: Schedule-shape ablation at matched compute, shown as a two-panel final-checkpoint view at source ↗

**Figure 11.** Figure 11: Agentic-revision benchmark summary. Left: first-pass versus post-revision overall pass view at source ↗

read the original abstract

We study AI alignment through the lens of law-and-economics models of deterrence and enforcement. In these models, misconduct is not treated as an external failure, but as a strategic response to incentives: an actor weighs the gain from violation against the probability of detection and the severity of punishment. We argue that the same logic arises naturally in agentic AI pipelines. A solver may benefit from producing a persuasive but incorrect answer, hiding uncertainty, or exploiting spurious shortcuts, while an auditor or verifier must decide whether costly monitoring is worthwhile. Alignment is therefore a fixed-point problem: stronger penalties may deter solver misbehavior, but they can also reduce the auditor's incentive to inspect, since auditing then mainly incurs cost on a population that appears increasingly aligned. This perspective also changes what should count as a post-training signal. Standard feedback often attaches reward to the final answer alone, but a solver-auditor pipeline exposes the full correction event: whether the solver erred, whether the auditor inspected, whether the error was caught, and whether oversight incentives remained active. We formalize this interaction in a two-agent model in which a principal chooses rewards over joint correction outcomes, inducing both solver behavior and auditor monitoring. Reward design is therefore a bilevel optimization problem: rewards are judged not by their immediate semantic meaning, but by the behavioral equilibrium they induce. We propose a bandit-based outer-loop procedure for searching over reward profiles using noisy interaction feedback. Experiments on an LLM coding pipeline show that adaptive reward profiles can maintain useful oversight pressure and improve principal-aligned outcomes relative to static hand-designed rewards, including a substantial reduction in hallucinated incorrect attempts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a bilevel incentive model for solver-auditor alignment and shows adaptive rewards cut hallucinations in a coding pipeline, but the experiments skip checking whether behaviors actually match the predicted equilibria.

read the letter

The core idea is to treat alignment as a fixed-point problem where a principal sets rewards over joint solver-auditor outcomes so that the solver stops cutting corners and the auditor keeps inspecting at the right rate. They cast reward design as bilevel optimization and use a bandit loop to search over profiles from noisy interaction data. That framing is new relative to standard RLHF, which usually just rewards final answers without modeling the auditor's cost-benefit trade-off explicitly. The experiments on an LLM coding pipeline are the practical part: adaptive profiles maintained oversight pressure and reduced hallucinated incorrect attempts compared with static hand-designed rewards. That outcome gain is worth noting because it suggests interaction-level signals can matter more than isolated answer rewards. The main gap is that the reported metrics are only final performance numbers. There is no check on whether inspection frequencies, error rates, or correction events line up with the Nash or fixed-point conditions derived from the two-agent model. If the actual trajectories deviate from those conditions, the lift could come from any adaptive schedule rather than the specific deterrence mechanism. The assumption that LLM solver-auditor loops behave like strategic economic agents is also left untested in the data shown. This work is aimed at alignment researchers who already think in terms of incentives and multi-agent setups. Readers looking for a different angle on post-training oversight will find the model and the bandit procedure useful to think about, even if the current evidence is preliminary. It deserves a serious referee because the framing is coherent and the experiments point to a concrete direction, though any review would need to press for equilibrium diagnostics and clearer links between theory and observed behavior.

Referee Report

2 major / 2 minor

Summary. The paper models AI alignment in agentic pipelines as a strategic deterrence problem between a solver (which may produce incorrect outputs for gain) and an auditor (which decides on costly monitoring). It formalizes this as a two-agent game where a principal designs rewards over joint outcomes (error, inspection, correction) via bilevel optimization to induce aligned behavioral equilibria, proposes a bandit-based search over reward profiles using interaction feedback, and reports experimental gains on an LLM coding pipeline: adaptive rewards maintain oversight pressure and reduce hallucinated incorrect attempts relative to static hand-designed rewards.

Significance. If the central claim holds, the work supplies a concrete mechanism for post-training alignment that treats oversight as an endogenous equilibrium rather than an external constraint, potentially improving robustness in multi-agent LLM systems. The explicit use of law-and-economics deterrence logic and the bandit outer loop for reward search are distinctive contributions; the empirical demonstration on a coding pipeline provides a falsifiable testbed even if the equilibrium-matching analysis remains incomplete.

major comments (2)

[Experiments] Experiments section: the reported improvements (reduction in hallucinated attempts, higher aligned outcomes) are measured only on final pipeline metrics; no data or analysis is supplied on whether observed inspection frequencies, error rates, or correction-event distributions match the Nash or fixed-point conditions derived from the two-agent bilevel model. Without this check, the performance lift cannot be attributed to the deterrence mechanism rather than any adaptive reward schedule.
[Model formalization] Model formalization (Section 3): the bilevel reward-design problem is described conceptually but the manuscript supplies neither the explicit payoff matrices nor the equilibrium conditions (e.g., the fixed-point equation relating solver misbehavior probability to auditor inspection cost under the chosen reward profile). This absence prevents verification that the bandit search actually converges to the claimed stable aligned equilibria.

minor comments (2)

[Notation] Notation for the joint outcome space (error, inspection, correction) is introduced in the abstract but never tabulated or given consistent symbols in the main text, making it difficult to map the described correction events to the reward vector.
[Bandit procedure] The bandit procedure is said to use 'noisy interaction feedback,' yet no description is given of the reward-estimation variance, exploration schedule, or convergence criterion used in the outer loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The two major comments identify important gaps in connecting the empirical results to the theoretical model and in making the formalization fully explicit. We address each below and commit to revisions that strengthen these links without altering the core claims.

read point-by-point responses

Referee: [Experiments] Experiments section: the reported improvements (reduction in hallucinated attempts, higher aligned outcomes) are measured only on final pipeline metrics; no data or analysis is supplied on whether observed inspection frequencies, error rates, or correction-event distributions match the Nash or fixed-point conditions derived from the two-agent bilevel model. Without this check, the performance lift cannot be attributed to the deterrence mechanism rather than any adaptive reward schedule.

Authors: We agree that the current experimental presentation focuses on end-to-end pipeline metrics and does not directly validate the equilibrium predictions. In the revision we will add a dedicated subsection that logs and reports the empirical frequencies of inspection events, solver error rates, and correction outcomes across reward profiles. These will be compared against the fixed-point conditions implied by the bilevel model (solver misbehavior probability as a function of auditor inspection cost and the chosen reward vector). This analysis will use the same interaction traces already collected, allowing readers to assess whether the observed behavior is consistent with the deterrence equilibrium rather than generic adaptivity. revision: yes
Referee: [Model formalization] Model formalization (Section 3): the bilevel reward-design problem is described conceptually but the manuscript supplies neither the explicit payoff matrices nor the equilibrium conditions (e.g., the fixed-point equation relating solver misbehavior probability to auditor inspection cost under the chosen reward profile). This absence prevents verification that the bandit search actually converges to the claimed stable aligned equilibria.

Authors: The model is intentionally kept at a level that applies across different agentic pipelines, which is why explicit numerical matrices were omitted. We accept that this reduces verifiability. In the revised Section 3 we will insert a new subsection that (i) defines the two-agent payoff matrix over the joint outcomes (error, inspection, correction, reward), (ii) states the fixed-point equation that equates the solver’s best-response misbehavior probability to the auditor’s best-response inspection probability under a given reward profile, and (iii) shows how the outer bandit loop is designed to search for reward vectors that induce equilibria satisfying the principal’s alignment objective. This will make the convergence claim directly checkable. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no circular reductions identified

full rationale

The paper draws its core framing from external law-and-economics deterrence models, introduces an independent two-agent formalization of solver-auditor interactions as a bilevel reward-design problem, and proposes a bandit outer loop as a search procedure. Experimental results on the LLM coding pipeline are reported as empirical outcomes of applying this procedure, not as quantities that reduce by construction to fitted parameters or prior self-citations. No equations, uniqueness theorems, or ansatzes are shown to be self-referential or load-bearing only via author-overlapping citations. The central claim therefore retains independent content beyond its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that AI pipeline behavior can be usefully modeled as strategic economic actors responding to detection and punishment probabilities.

axioms (1)

domain assumption AI agents in solver-auditor pipelines act strategically to maximize their individual rewards, analogous to economic actors in deterrence models.
This modeling choice enables the fixed-point and bilevel formulation described in the abstract.

pith-pipeline@v0.9.0 · 5591 in / 1315 out tokens · 46476 ms · 2026-05-12T04:48:07.623348+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Alignment is therefore a fixed-point problem: stronger penalties may deter solver misbehavior, but they can also reduce the auditor's incentive to inspect... Reward design is therefore a bilevel optimization problem... We propose a bandit-based outer-loop procedure for searching over reward profiles
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the unique mixed equilibrium is (p*_align, p*_audit) = ... By adjusting the reward vector r one can set this pair to be any arbitrary point in [0,1]^2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 6 internal anchors

[1]

Analysis of thompson sampling for the multi-armed bandit problem

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. InProceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 39.1–39.26. PMLR, 2012

work page 2012
[2]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mane. Concrete problems in AI safety.arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235–256, 2002

Peter Auer, Nicol` o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235–256, 2002

work page 2002
[4]

Schapire

Peter Auer, Nicol` o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem.SIAM Journal on Computing, 32(1):48–77, 2002

work page 2002
[5]

Program synthesis with large language models, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021

work page 2021
[6]

Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv, 2025

work page 2025
[7]

Gary S. Becker. Crime and punishment: An economic approach.Journal of Political Economy, 76(2):169–217, 1968

work page 1968
[8]

Becker and George J

Gary S. Becker and George J. Stigler. Law enforcement, malfeasance, and compensation of enforcers.Journal of Legal Studies, 3(1):1–18, 1974

work page 1974
[9]

Adversarial reward auditing for active detection and mitigation of reward hacking, 2026

Mohammad Beigi, Ming Jin, Junshan Zhang, Qifan Wang, and Lifu Huang. Adversarial reward auditing for active detection and mitigation of reward hacking.arXiv preprint arXiv:2602.01750, 2026

work page arXiv 2026
[10]

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschen- brenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390, 2023

work page arXiv 2023
[11]

Principal-driven reward design and agent policy alignment via bilevel-rl

Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Furong Huang, and Mengdi Wang. Principal-driven reward design and agent policy alignment via bilevel-rl. InInteractive Learning with Implicit Human Feedback Workshop (ILHF), ICML, 2023

work page 2023
[12]

An empirical evaluation of thompson sampling

Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. InAdvances in Neural Information Processing Systems, 2011

work page 2011
[13]

Supervising strong learners by amplifying weak experts

Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts.arXiv preprint arXiv:1810.08575, 2018. 16

work page Pith review arXiv 2018
[14]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[15]

How might we safely pass the buck to AI?, 2025

Josh Clymer. How might we safely pass the buck to AI?, 2025. AI Alignment Forum / Redwood Research Blog post, February 19, 2025

work page 2025
[16]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Escaping the cognitive well: Efficient competition math with off-the-shelf models, 2026

Xingyu Dang, Rohit Agarwal, Rodrigo Porto, Anirudh Goyal, Liam H Fowl, and Sanjeev Arora. Escaping the cognitive well: Efficient competition math with off-the-shelf models, 2026

work page 2026
[18]

Liir: Learning individual intrinsic reward in multi-agent reinforcement learning

Yali Du, Lei Han, Meng Fang, Ji Liu, Tianhong Dai, and Dacheng Tao. Liir: Learning individual intrinsic reward in multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[19]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, 2021

Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. arXiv preprint arXiv:1908.04734, 2019

work page arXiv 1908
[21]

Flaxman, Adam Tauman Kalai, and H

Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. InProceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 385–394. Society for Industrial and Applied Mathematics, 2005

work page 2005
[22]

Wright, and Kevin Leyton-Brown

Xi Alice Gao, James R. Wright, and Kevin Leyton-Brown. Incentivizing evaluation with peer prediction and limited access to ground truth.Artificial Intelligence, 275:618–638, 2019

work page 2019
[23]

AI control: Improving safety despite intentional subversion

Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. AI control: Improving safety despite intentional subversion. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 16295–16336. PMLR, 2024

work page 2024
[24]

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Charlie Griffin, Louis Thomson, Buck Shlegeris, and Alessandro Abate. Games for AI control: Models of safety evaluations of AI deployment protocols.arXiv preprint arXiv:2409.07985, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Hadfield

Dylan Hadfield-Menell and Gillian K. Hadfield. Incomplete contracting and AI alignment. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 417–422. ACM, 2019

work page 2019
[26]

Measuring coding challenge competence with apps

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. 17

work page 2021
[27]

Automating auditing: An ambitious concrete technical research proposal, 2021

Evan Hubinger. Automating auditing: An ambitious concrete technical research proposal, 2021. LessWrong post, August 11, 2021

work page 2021
[28]

AI safety via debate

Geoffrey Irving, Paul Christiano, and Dario Amodei. Ai safety via debate.arXiv preprint arXiv:1805.00899, 2018

work page internal anchor Pith review arXiv 2018
[29]

Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

work page 2024
[30]

Vempala, and Edwin Zhang

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. Evaluating Large Language Models for accuracy incentivizes hallucinations.Nature, 2026

work page 2026
[31]

Collusion in hierarchical agency.Econometrica, 61(3):629– 656, 1993

Fred Kofman and Jacques Lawarree. Collusion in hierarchical agency.Econometrica, 61(3):629– 656, 1993

work page 1993
[32]

Goal misgeneralization in deep reinforcement learning

Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, and David Krueger. Goal misgeneralization in deep reinforcement learning. InInternational Conference on Machine Learning, 2022

work page 2022
[33]

Cambridge University Press, Cam- bridge, United Kingdom, 2020

Tor Lattimore and Csaba Szepesv´ ari.Bandit Algorithms. Cambridge University Press, Cam- bridge, United Kingdom, 2020

work page 2020
[34]

Rlaif vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback, 2024

work page 2024
[35]

Scalable agent alignment via reward modeling: a research direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: A research direction.arXiv preprint arXiv:1811.07871, 2018

work page Pith review arXiv 2018
[36]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc

work page 2023
[38]

Wooldridge, and Jiarui Gan

Julian Manyika, Michael J. Wooldridge, and Jiarui Gan. Mechanism design for alignment via human feedback, 2025. Second Workshop on Models of Human Feedback for AI Alignment, ICML 2025

work page 2025
[39]

McAleese, R

Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, and Jan Leike. LLM critics help catch LLM bugs.arXiv preprint arXiv:2407.00215, 2024

work page arXiv 2024
[40]

Dilip Mookherjee and Ivan P. L. Png. Monitoring vis-` a-vis investigation in enforcement of law. American Economic Review, 82(3):556–565, 1992

work page 1992
[41]

Random gradient-free minimization of convex functions

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017. 18

work page 2017
[42]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[43]

Adaptive reinforcement learning with LLM-augmented reward functions.TechRxiv, dec 2023

Alex Place. Adaptive reinforcement learning with LLM-augmented reward functions.TechRxiv, dec 2023

work page 2023
[44]

Procaccia

Itai Shapira, Gerdus Benade, and Ariel D. Procaccia. How rlhf amplifies sycophancy, 2026

work page 2026
[45]

How to prevent collusion when using untrusted models to monitor each other,

Buck Shlegeris. How to prevent collusion when using untrusted models to monitor each other,

work page
[46]

LessWrong post, September 25, 2024

work page 2024
[47]

George J. Stigler. The optimum enforcement of laws.Journal of Political Economy, 78(3):526– 536, 1970

work page 1970
[48]

Beyond verifiable rewards: Scaling reinforcement learning for language models to unverifiable data, 2025

Yunhao Tang, Sid Wang, Lovish Madaan, and R´ emi Munos. Beyond verifiable rewards: Scaling reinforcement learning for language models to unverifiable data, 2025

work page 2025
[49]

Thompson

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3–4):285–294, 1933

work page 1933
[50]

Hierarchies and bureaucracies: On the role of collusion in organizations.The Journal of Law, Economics, and Organization, 2(2):181–214, 1986

Jean Tirole. Hierarchies and bureaucracies: On the role of collusion in organizations.The Journal of Law, Economics, and Organization, 2(2):181–214, 1986

work page 1986
[51]

Vazquez, and Julian Michael

Yoav Tzfati, McKenna Fitzgerald, Juan J. Vazquez, and Julian Michael. Evaluating oversight robustness with incentivized reward hacking, 2024. ICLR 2025 withdrawn submission

work page 2024
[52]

Automated weak-to-strong researcher

Jiaxin Wen, Liang Qiu, Joe Benton, Jan Hendrik Kirchner, and Jan Leike. Automated weak-to-strong researcher. Anthropic Alignment Science Blog, April 2026

work page 2026
[53]

tp” denotestrue positiveand “fp

Jiachen Yang, Ang Li, Mehrdad Farajtabar, Peter Sunehag, Edward Hughes, and Hongyuan Zha. Learning to incentivize other learning agents, 2020. 19 A Extension of the Theoretical Model to Non-Verifiable Settings with Possible Hallucination We extend the model of Section 3 to the case of a noisy solver and noisy judge protocol, under only weak accuracy assum...

work page arXiv 2020