Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight

Can Jin; Dimitris N. Metaxas; Eddy Zhang; Jiakang Li; Rui Wu

arxiv: 2606.00424 · v1 · pith:X4LXBMEXnew · submitted 2026-05-29 · 💻 cs.AI

Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight

Can Jin , Jiakang Li , Rui Wu , Eddy Zhang , Dimitris N. Metaxas This is my paper

Pith reviewed 2026-06-28 21:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords weak supervisionscalable oversightcritique distillationon-policy learninglarge language modelsreasoning benchmarksalignment

0 comments

The pith

A weak model used only as a critic can guide a stronger model to better use its own knowledge by supplying revision directions rather than answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates a form of weak supervision where a smaller model acts solely as a critic that suggests non-misleading revision directions instead of solving tasks or rendering final judgments. This setup allows the strong model to leverage its existing capabilities more effectively, and experiments demonstrate that such critiques improve the performance of frozen strong models at inference time when critique quality is high. The authors introduce progressive on-policy critique distillation to select high-quality critiques and transfer the resulting behavior into the strong model through adaptive self-teacher signals. Results on reasoning and alignment benchmarks show gains that accumulate across training epochs, pointing to a route for scalable oversight even when weak models cannot provide reliable labels.

Core claim

Weak critiques from a smaller model improve the outputs of a frozen stronger model at inference time, and the quality of those critiques determines how much improvement occurs. Progressive on-policy critique distillation filters the higher-quality critiques generated during training and distills the critic-guided revisions into the strong model using self-generated teacher signals, producing measurable gains on reasoning and alignment benchmarks that increase over successive epochs.

What carries the argument

progressive on-policy critique distillation (OPCD), which filters high-quality critiques and transfers critic-guided revision behavior into the strong model via adaptive self-teacher signals

If this is right

Weak critiques improve the performance of frozen strong models at inference time.
Higher critique quality produces larger gains in the strong model's outputs.
OPCD enables the strong model to show progressive improvement across training epochs.
The approach provides a concrete path toward scalable oversight that relies only on weak supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If revision directions remain useful even when the weak model cannot judge correctness, the method could support oversight of models whose outputs exceed direct human evaluation.
The same filtering and self-teacher mechanism might be applied iteratively within a single model family to create internal improvement loops at different capability levels.
Partial directional signals may prove more robust than full but noisy judgments when aligning models on tasks where complete evaluation is itself hard.

Load-bearing premise

The weak model can reliably supply revision directions that are non-misleading and help the strong model access its own knowledge rather than requiring the weak model to solve the underlying task.

What would settle it

Running the weak critic's revision suggestions on held-out reasoning and alignment tasks and finding no consistent performance lift or an actual drop in the strong model's accuracy would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.00424 by Can Jin, Dimitris N. Metaxas, Eddy Zhang, Jiakang Li, Rui Wu.

**Figure 1.** Figure 1: Pass@k performance reported as accuracy percentage under different inference budgets on GPQA Diamond and IFEval. The strong model is Phi-4-14B and the weak model is Phi-4-miniinstruct. The shaded region highlights the performance gap between the Sonly baseline and the S + W critic + ref ine setting. lems in total. This setting tests whether the usefulness of weak-model critiques also scales to thinking m… view at source ↗

**Figure 2.** Figure 2: Overview of the OPCD pipeline. We optimize the student by minimizing the token-level KL divergence: LOPCD(θ) = 1 |Se| X (x,y,f)∈Se X |y| t=1 KL πθ(· | x, y<t) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics of OPCD in alignment and reasoning scenarios. The left column shows results in the alignment scenario, while the right column shows results in the reasoning scenario. For each scenario, we report the training loss, training score, and validation performance over global training steps. The highlighted regions in Validation indicate the improvement over initial validation on the test set be… view at source ↗

read the original abstract

As large language models become stronger, weak supervisors may fail to provide reliable labels, preferences, or final judgments for complex outputs, limiting both weak-to-strong generalization and scalable oversight. We study a more tractable form of weak supervision: using a weak model as a critic rather than as a labeler or judge. Instead of solving the task or selecting the correct answer, the weak critic only needs to provide a non-misleading revision direction that helps the strong model better use its own knowledge. We call this setting *weak-critic strong oversight*. We first show that weak critiques can improve frozen strong models at inference time, and that critique quality is key to this improvement. We then propose progressive on-policy critique distillation (**OPCD**), which filters high-quality critiques and distills critic-guided behavior into the strong model through adaptive self-teacher signals. Experiments on reasoning and alignment benchmarks show that our method improves strong models over training epochs, suggesting an effective path for scalable oversight with weak supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The inference-time result with weak critiques on frozen models is the cleanest part; the OPCD training loop probably leans on ground-truth filtering and that undercuts the weak-supervision claim.

read the letter

The paper frames a useful distinction: a weak model only has to point out a revision direction rather than solve the task or pick the right answer. They show this can lift a frozen strong model at inference time and that better critiques produce bigger gains. That piece is straightforward and worth noting if the controls are decent.

OPCD then tries to turn those critiques into training signal by filtering high-quality ones and distilling via self-teacher signals. The abstract presents this as progressive improvement over epochs on reasoning and alignment benchmarks. The framing itself is new enough to separate from standard reward modeling or weak-to-strong generalization.

The soft spot is exactly the one the stress test flags. Filtering for high-quality critiques on reasoning benchmarks usually needs ground-truth labels or strong-model verification. If that is happening here, the training gains are closer to ordinary supervised distillation than to pure weak oversight. The inference-time result avoids this problem; the distillation claim does not. The abstract gives no details on the filter, baselines, or statistical tests, so it is hard to tell how much of the reported improvement is real versus artifact.

This is for people already working on scalable oversight who want to think about critique-based rather than label-based supervision. A reader could extract the inference-time experiment and the weak-critic framing, but the training results need the full paper to stand up.

I would send it to review if the authors can show the filter runs without oracle access; otherwise it is a desk-reject risk on the central claim.

Referee Report

2 major / 0 minor

Summary. The paper claims that weak models can serve as critics (providing non-misleading revision directions rather than labels or judgments) to improve stronger models in a 'weak-critic strong oversight' setting. It reports that weak critiques improve frozen strong models at inference time when critique quality is high, and introduces On-Policy Critique Distillation (OPCD) to filter high-quality critiques and distill critic-guided behavior into the strong model via adaptive self-teacher signals. Experiments on reasoning and alignment benchmarks are said to show improvements in strong models over training epochs.

Significance. If the OPCD filtering step can be implemented without ground-truth or strong-model oracle signals, the approach could offer a concrete mechanism for scalable oversight that leverages weak models without requiring them to solve tasks. The inference-time result on frozen models would be a useful finding if it generalizes. The paper does not report machine-checked proofs or parameter-free derivations.

major comments (2)

[Abstract] Abstract: The OPCD description states that the method 'filters high-quality critiques and distills critic-guided behavior into the strong model through adaptive self-teacher signals,' yet supplies no criteria for determining critique quality. If quality filtering relies on ground-truth labels (standard for reasoning benchmarks), the training loop has oracle access that the weak-critic premise is intended to avoid; this directly affects whether the reported epoch-wise gains can be attributed to weak supervision.
[Abstract] Abstract (empirical claims): The abstract reports that 'experiments on reasoning and alignment benchmarks show that our method improves strong models over training epochs' but provides no information on experimental controls, baseline comparisons, critique filtering implementation, or statistical significance. These omissions are load-bearing because the central claim of effective weak-critic distillation rests on the empirical results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on the abstract. We address the two major comments point-by-point below, clarifying the OPCD filtering mechanism and experimental details while proposing targeted revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The OPCD description states that the method 'filters high-quality critiques and distills critic-guided behavior into the strong model through adaptive self-teacher signals,' yet supplies no criteria for determining critique quality. If quality filtering relies on ground-truth labels (standard for reasoning benchmarks), the training loop has oracle access that the weak-critic premise is intended to avoid; this directly affects whether the reported epoch-wise gains can be attributed to weak supervision.

Authors: The filtering criteria in OPCD rely exclusively on adaptive self-teacher signals: a critique is retained only if the strong model's subsequent on-policy response shows measurable improvement according to the model's own internal consistency or self-evaluation metrics, without any ground-truth labels or external oracles. This is formalized in Section 3.2 and Algorithm 1 of the manuscript. The premise of weak-critic oversight is preserved because the weak model supplies only revision directions and the selection signal is generated on-policy by the strong model itself. We will revise the abstract to explicitly state that filtering uses self-teacher signals rather than ground-truth. revision: yes
Referee: [Abstract] Abstract (empirical claims): The abstract reports that 'experiments on reasoning and alignment benchmarks show that our method improves strong models over training epochs' but provides no information on experimental controls, baseline comparisons, critique filtering implementation, or statistical significance. These omissions are load-bearing because the central claim of effective weak-critic distillation rests on the empirical results.

Authors: The abstract is intentionally concise per venue norms, but the full manuscript (Sections 4 and 5) specifies: (i) controls including frozen strong-model baselines and standard supervised fine-tuning without critiques; (ii) critique filtering implemented via the on-policy self-teacher procedure described above; (iii) multiple reasoning (e.g., GSM8K, MATH) and alignment benchmarks with epoch-wise curves; and (iv) statistical significance via paired t-tests and bootstrap confidence intervals. We will expand the abstract by one sentence to note the on-policy, oracle-free nature of the distillation and the inclusion of these controls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks

full rationale

The paper proposes an empirical method (OPCD) and reports performance improvements on reasoning and alignment benchmarks. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps are present in the provided text. The central claims are externally falsifiable via benchmark results rather than reducing to self-referential definitions or inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method name and setting are introduced at a high level without further decomposition.

pith-pipeline@v0.9.1-grok · 5712 in / 1090 out tokens · 30308 ms · 2026-06-28T21:53:59.604841+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 13 internal anchors

[1]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Concrete Problems in AI Safety

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul- man, J., and Mané, D. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2408.11791 , year =

Ankner, Z., Paul, M., Cui, B., Chang, J. D., and Am- manabrolu, P. Critique-out-Loud reward models.arXiv preprint arXiv:2408.11791,

work page arXiv
[4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Art of Problem Solving. 2024 aime i problems and so- lutions. https://artofproblemsolving.com/ wiki/index.php/2024_AIME_I, 2024a. Ac- cessed: 2026-05-24. Art of Problem Solving. 2024 aime ii problems and so- 8 Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight lutions. https://artofproblemsolving.com/ wiki/index.php/...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

arXiv preprint arXiv:2312.09390 , year =

Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y ., Ecoffet, A., Joglekar, M., Leike, J., et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390,

work page arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capa- bility in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

W2S-AlignTree: Weak-to-strong inference- time alignment for large language models via monte carlo tree search.arXiv preprint arXiv:2511.11518,

Ding, Z., Wang, Y ., Xiao, T., Wang, H., Jiang, C., and Ding, N. W2S-AlignTree: Weak-to-strong inference- time alignment for large language models via monte carlo tree search.arXiv preprint arXiv:2511.11518,

work page arXiv
[8]

MiniLLM: On-Policy Distillation of Large Language Models

Gu, Y ., Dong, L., Wei, F., and Huang, M. Minillm: Knowl- edge distillation of large language models.arXiv preprint arXiv:2306.08543,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

AI safety via debate

Irving, G., Christiano, P., and Amodei, D. AI safety via debate.arXiv preprint arXiv:1805.00899,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2508.14313 , year =

Jin, C., Zhou, Y ., Zhang, Q., Peng, H., Zhang, D., Dong, Z., Pavone, M., Han, L., Hong, Z.-W., Che, T., et al. Your re- ward function for rl is your best prm for search: Unifying rl and search-based tts.arXiv preprint arXiv:2508.14313,

work page arXiv
[11]

Reason- ing over precedents alongside statutes: Case-augmented deliberative alignment for llm safety.arXiv preprint arXiv:2601.08000,

Jin, C., Wu, R., Che, T., Zhang, Q., Peng, H., Zhao, J., Wang, Z., Wei, W., Han, L., Zhang, Z., et al. Reason- ing over precedents alongside statutes: Case-augmented deliberative alignment for llm safety.arXiv preprint arXiv:2601.08000,

work page arXiv
[12]

R., Rocktäschel, T., and Perez, E

Khan, A., Hughes, J., Valentine, D., Ruis, L., Sachan, K., Radhakrishnan, A., Grefenstette, E., Bowman, S. R., Rocktäschel, T., and Perez, E. Debating with more persua- sive llms leads to more truthful answers.arXiv preprint arXiv:2402.06782,

work page arXiv
[13]

H., Chen, Y ., Edwards, H., Leike, J., McAleese, N., and Burda, Y

Kirchner, J. H., Chen, Y ., Edwards, H., Leike, J., McAleese, N., and Burda, Y . Prover-verifier games improve legibility of LLM outputs.arXiv preprint arXiv:2407.13692,

work page arXiv
[14]

Training Language Models to Self-Correct via Reinforcement Learning

Kumar, A., Zhuang, V ., Agarwal, R., Su, Y ., Co-Reyes, J. D., Singh, A., Baumli, K., Iqbal, S., Bishop, C., Roelofs, R., et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

M., Ceron Uribe, J

McAleese, N., Pokorny, R. M., Ceron Uribe, J. F., Nitishin- skaya, E., Trebacz, M., and Leike, J. LLM critics help catch LLM bugs.arXiv preprint arXiv:2407.00215,

work page arXiv
[16]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Self-critiquing models for assisting human evaluators

Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., and Leike, J. Self-critiquing models for assisting human evaluators.arXiv preprint arXiv:2206.05802,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

and Li, Y

Tao, L. and Li, Y . Your weak llm is secretly a strong teacher for alignment.arXiv preprint arXiv:2409.08813,

work page arXiv
[20]

arXiv preprint arXiv:2506.10139 , year=

Wen, J., Ankner, Z., Somani, A., Hase, P., Marks, S., Goldman-Wetzler, J., Petrini, L., Sleight, H., Burns, C., He, H., Feng, S., Perez, E., and Leike, J. Unsuper- vised elicitation of language models.arXiv preprint arXiv:2506.10139,

work page arXiv
[21]

arXiv preprint arXiv:2407.19594 , year =

Wu, T., Yuan, W., Golovneva, O., Xu, J., Tian, Y ., Jiao, J., Weston, J., and Sukhbaatar, S. Meta-rewarding language models: Self-improving alignment with LLM-as-a-Meta- Judge.arXiv preprint arXiv:2407.19594,

work page arXiv
[22]

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

URLhttps://arxiv.org/abs/2406.08464. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, W., Shen, S., Shen, G., Gong, W., Yao, Y ., and Lin, Y . Super(ficial)-alignment: Strong models may deceive weak models in weak-to-strong generalizat...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Self-Rewarding Language Models

Yuan, W., Pang, R. Y ., Cho, K., Sukhbaatar, S., Xu, J., and Weston, J. Self-rewarding language models.arXiv preprint arXiv:2401.10020,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Instruction-Following Evaluation for Large Language Models

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Concrete Problems in AI Safety

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul- man, J., and Mané, D. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:2408.11791 , year =

Ankner, Z., Paul, M., Cui, B., Chang, J. D., and Am- manabrolu, P. Critique-out-Loud reward models.arXiv preprint arXiv:2408.11791,

work page arXiv

[4] [4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Art of Problem Solving. 2024 aime i problems and so- lutions. https://artofproblemsolving.com/ wiki/index.php/2024_AIME_I, 2024a. Ac- cessed: 2026-05-24. Art of Problem Solving. 2024 aime ii problems and so- 8 Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight lutions. https://artofproblemsolving.com/ wiki/index.php/...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

arXiv preprint arXiv:2312.09390 , year =

Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y ., Ecoffet, A., Joglekar, M., Leike, J., et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390,

work page arXiv

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capa- bility in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

W2S-AlignTree: Weak-to-strong inference- time alignment for large language models via monte carlo tree search.arXiv preprint arXiv:2511.11518,

Ding, Z., Wang, Y ., Xiao, T., Wang, H., Jiang, C., and Ding, N. W2S-AlignTree: Weak-to-strong inference- time alignment for large language models via monte carlo tree search.arXiv preprint arXiv:2511.11518,

work page arXiv

[8] [8]

MiniLLM: On-Policy Distillation of Large Language Models

Gu, Y ., Dong, L., Wei, F., and Huang, M. Minillm: Knowl- edge distillation of large language models.arXiv preprint arXiv:2306.08543,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

AI safety via debate

Irving, G., Christiano, P., and Amodei, D. AI safety via debate.arXiv preprint arXiv:1805.00899,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

arXiv preprint arXiv:2508.14313 , year =

Jin, C., Zhou, Y ., Zhang, Q., Peng, H., Zhang, D., Dong, Z., Pavone, M., Han, L., Hong, Z.-W., Che, T., et al. Your re- ward function for rl is your best prm for search: Unifying rl and search-based tts.arXiv preprint arXiv:2508.14313,

work page arXiv

[11] [11]

Reason- ing over precedents alongside statutes: Case-augmented deliberative alignment for llm safety.arXiv preprint arXiv:2601.08000,

Jin, C., Wu, R., Che, T., Zhang, Q., Peng, H., Zhao, J., Wang, Z., Wei, W., Han, L., Zhang, Z., et al. Reason- ing over precedents alongside statutes: Case-augmented deliberative alignment for llm safety.arXiv preprint arXiv:2601.08000,

work page arXiv

[12] [12]

R., Rocktäschel, T., and Perez, E

Khan, A., Hughes, J., Valentine, D., Ruis, L., Sachan, K., Radhakrishnan, A., Grefenstette, E., Bowman, S. R., Rocktäschel, T., and Perez, E. Debating with more persua- sive llms leads to more truthful answers.arXiv preprint arXiv:2402.06782,

work page arXiv

[13] [13]

H., Chen, Y ., Edwards, H., Leike, J., McAleese, N., and Burda, Y

Kirchner, J. H., Chen, Y ., Edwards, H., Leike, J., McAleese, N., and Burda, Y . Prover-verifier games improve legibility of LLM outputs.arXiv preprint arXiv:2407.13692,

work page arXiv

[14] [14]

Training Language Models to Self-Correct via Reinforcement Learning

Kumar, A., Zhuang, V ., Agarwal, R., Su, Y ., Co-Reyes, J. D., Singh, A., Baumli, K., Iqbal, S., Bishop, C., Roelofs, R., et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

M., Ceron Uribe, J

McAleese, N., Pokorny, R. M., Ceron Uribe, J. F., Nitishin- skaya, E., Trebacz, M., and Leike, J. LLM critics help catch LLM bugs.arXiv preprint arXiv:2407.00215,

work page arXiv

[16] [16]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Self-critiquing models for assisting human evaluators

Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., and Leike, J. Self-critiquing models for assisting human evaluators.arXiv preprint arXiv:2206.05802,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

and Li, Y

Tao, L. and Li, Y . Your weak llm is secretly a strong teacher for alignment.arXiv preprint arXiv:2409.08813,

work page arXiv

[20] [20]

arXiv preprint arXiv:2506.10139 , year=

Wen, J., Ankner, Z., Somani, A., Hase, P., Marks, S., Goldman-Wetzler, J., Petrini, L., Sleight, H., Burns, C., He, H., Feng, S., Perez, E., and Leike, J. Unsuper- vised elicitation of language models.arXiv preprint arXiv:2506.10139,

work page arXiv

[21] [21]

arXiv preprint arXiv:2407.19594 , year =

Wu, T., Yuan, W., Golovneva, O., Xu, J., Tian, Y ., Jiao, J., Weston, J., and Sukhbaatar, S. Meta-rewarding language models: Self-improving alignment with LLM-as-a-Meta- Judge.arXiv preprint arXiv:2407.19594,

work page arXiv

[22] [22]

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

URLhttps://arxiv.org/abs/2406.08464. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, W., Shen, S., Shen, G., Gong, W., Yao, Y ., and Lin, Y . Super(ficial)-alignment: Strong models may deceive weak models in weak-to-strong generalizat...

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Self-Rewarding Language Models

Yuan, W., Pang, R. Y ., Cho, K., Sukhbaatar, S., Xu, J., and Weston, J. Self-rewarding language models.arXiv preprint arXiv:2401.10020,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Instruction-Following Evaluation for Large Language Models

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv