arxiv: 2206.05802 · v2 · submitted 2022-06-12 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Self-critiquing models for assisting human evaluators

William Saunders , Catherine Yeh , Jeff Wu , Steven Bills , Long Ouyang , Jonathan Ward , Jan Leike

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:22 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords self-critiquing modelsnatural language critiquesAI-assisted evaluationsummarization flawsbehavioral cloningscaling lawshuman feedbacklarge language models

0 comments

The pith

Fine-tuned language models can generate critiques that help humans identify flaws in summaries they would otherwise overlook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large language models fine-tuned via behavioral cloning on human-written critiques can assist evaluators in spotting errors within topic-based summaries. This assistance works for both naturally occurring mistakes in model- and human-generated summaries and for deliberate deceptions inserted by humans. The work examines how model scale affects critique quality and shows that larger models improve at both producing and using their own critiques to refine outputs. If this holds, it points toward using AI to extend human oversight to tasks where direct evaluation is currently too difficult or time-consuming.

Core claim

We fine-tune large language models to write natural language critiques using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. We study scaling properties of critiquing with both topic-based summarization and synthetic tasks. Larger models write more helpful critiques, and on most tasks are better at self-critiquing, despite having harder-to-critique outputs. Larger models can also integrate their own self-c

What carries the argument

Behavioral cloning from human critique examples to fine-tune large language models for producing natural language critiques that assist human evaluators on summarization tasks.

If this is right

Larger models produce more helpful critiques than smaller ones across the studied tasks.
Larger models are better at self-critiquing even though their generated summaries become harder to critique.
Models can use their own critiques as feedback to produce improved summaries.
Measurements indicate that even large models possess relevant knowledge they do not articulate in critique form.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fine-tuning approach could be applied to other domains such as code review or scientific paper assessment where direct human judgment is costly.
If critique quality continues to improve with scale, recursive loops of model self-critique and refinement may become feasible for iterative improvement.
Human evaluators might gain lasting skill improvements from repeated exposure to model-generated critiques on the same task.
The gap between what models can generate and what they can critique suggests a need for new training objectives that align articulation with internal knowledge.

Load-bearing premise

Training on human critique examples produces critiques that generalize to new summaries without introducing systematic biases or missing flaw types that humans would notice unaided.

What would settle it

An experiment in which humans detect fewer or the same number of summary flaws when given model critiques compared to evaluating the summaries alone, or where the critiques systematically overlook certain flaw categories that unaided humans reliably catch.

read the original abstract

We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. We study scaling properties of critiquing with both topic-based summarization and synthetic tasks. Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs. Larger models can also integrate their own self-critiques as feedback, refining their own summaries into better ones. Finally, we motivate and introduce a framework for comparing critiquing ability to generation and discrimination ability. Our measurements suggest that even large models may still have relevant knowledge they cannot or do not articulate as critiques. These results are a proof of concept for using AI-assisted human feedback to scale the supervision of machine learning systems to tasks that are difficult for humans to evaluate directly. We release our training datasets, as well as samples from our critique assistance experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuned models produce critiques that help humans catch more flaws in summaries, including adversarial ones, but the gains could stem from extra reading time rather than critique content.

read the letter

The main result is that LLMs fine-tuned via behavioral cloning on human critiques assist evaluators in spotting flaws they miss unaided, on both natural errors in model and human summaries and on planted misleading ones. Larger models write more useful critiques, self-critique better despite harder outputs, and can fold those critiques back in to refine their own summaries. They also release the datasets, which is useful for replication.

Referee Report

1 major / 2 minor

Summary. The paper fine-tunes large language models via behavioral cloning to generate natural-language critiques of summaries. On topic-based summarization, these critiques improve human detection of both naturally occurring flaws (in model- and human-written summaries) and intentionally planted flaws. The work reports scaling trends (larger models produce more helpful critiques and better self-critiques), shows that models can use their own critiques for self-refinement, and introduces a framework comparing critiquing ability to generation and discrimination. Datasets and samples are released as a proof of concept for AI-assisted human feedback on hard-to-evaluate tasks.

Significance. If the results hold after addressing controls, the work supplies a concrete, scalable mechanism for augmenting human oversight of generative models on tasks where direct evaluation is difficult. The empirical scaling observations, self-refinement results, and public release of training data and critique samples constitute tangible strengths that could inform subsequent work on AI-assisted supervision.

major comments (1)

[Human evaluation protocol] Human evaluation protocol (experiments on topic-based summarization and planted-flaw summaries): the central claim that model critiques cause humans to detect additional real flaws is not isolated from the confound of increased evaluation time or attention. No time-matched or placebo baseline (e.g., a neutral paragraph of equal length) is reported, so measured gains could arise from demand characteristics or extra scrutiny rather than critique content produced by behavioral cloning.

minor comments (2)

[Framework section] The framework comparing critiquing to generation and discrimination is introduced but would benefit from explicit operational definitions and a table summarizing the three capabilities side-by-side.
[Scaling and self-refinement experiments] Scaling plots and self-refinement results would be clearer with error bars or confidence intervals and explicit statement of the number of human evaluators per condition.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive summary of the work and for identifying a key methodological concern in the human evaluation protocol. We address this point directly below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Human evaluation protocol] Human evaluation protocol (experiments on topic-based summarization and planted-flaw summaries): the central claim that model critiques cause humans to detect additional real flaws is not isolated from the confound of increased evaluation time or attention. No time-matched or placebo baseline (e.g., a neutral paragraph of equal length) is reported, so measured gains could arise from demand characteristics or extra scrutiny rather than critique content produced by behavioral cloning.

Authors: We agree that this is a valid concern. Our current experiments compare the critique condition against a no-critique baseline in which evaluators assess summaries without any additional text, but this does not control for the effects of extra reading time, attention, or demand characteristics induced by any supplementary paragraph. To isolate the contribution of the critique content itself, we will add a placebo control condition in the revised manuscript: evaluators will read a neutral paragraph of matched length (e.g., a generic topic description unrelated to the summary) before performing the flaw-detection task. We will report the results of this additional control alongside the existing conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation from external human data

full rationale

The paper is an empirical ML study that fine-tunes models on human-written critiques via behavioral cloning and measures whether the resulting critiques help humans detect additional flaws in summaries. No mathematical derivations, equations, or first-principles results are present that could reduce to the inputs by construction. All reported outcomes (scaling trends, self-critique integration, comparison framework) are measured on held-out summaries and human raters, independent of any fitted-parameter renaming or self-citation load-bearing for uniqueness. The work therefore contains no steps matching the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human critique examples are sufficient to train generalizable critique behavior and that the evaluation tasks capture real-world oversight needs. No new physical entities or mathematical axioms are introduced.

axioms (1)

domain assumption Behavioral cloning from human critique demonstrations produces useful critique behavior on held-out summaries
Invoked when claiming that fine-tuned models help humans find flaws they would otherwise miss.

pith-pipeline@v0.9.0 · 5513 in / 1185 out tokens · 18123 ms · 2026-05-16T20:22:33.334217+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We fine-tune large language models to write natural language critiques using behavioral cloning... Larger models write more helpful critiques...
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery theorems unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We motivate and introduce a framework for comparing critiquing ability to generation and discrimination ability (GDC gaps).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Correction as Feedback Control: Error Dynamics, Stability Thresholds, and Prompt Interventions in LLMs
cs.AI 2026-04 conditional novelty 7.0

Self-correction in LLMs is stable and non-degrading only when ECR/EIR exceeds initial accuracy over (1-accuracy), with EIR below 0.5% cleanly separating helpful from harmful cases across models.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
cs.CL 2026-04 unverdicted novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
cs.CL 2026-05 unverdicted novelty 6.0

DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
Building a Precise Video Language with Human-AI Oversight
cs.CV 2026-04 unverdicted novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video gene...
No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning
cs.AI 2026-01 unverdicted novelty 6.0

ECHO jointly optimizes policy and critic via co-evolution, cascaded rollouts, and saturation-aware shaping to deliver non-stale feedback and higher success in open-world LLM agent RL.
Towards Understanding Sycophancy in Language Models
cs.CL 2023-10 conditional novelty 6.0

Sycophancy is prevalent in state-of-the-art AI assistants and is likely driven in part by human preferences that favor agreement over truthfulness.
Cognitive Architectures for Language Agents
cs.AI 2023-09 accept novelty 6.0

CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic de...
Simple synthetic data reduces sycophancy in large language models
cs.CL 2023-08 unverdicted novelty 6.0

Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
Teaching Large Language Models to Self-Debug
cs.CL 2023-04 unverdicted novelty 6.0

Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
cs.AI 2023-03 conditional novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness
cs.CL 2026-04 unverdicted novelty 5.0

FAITH improves LLM factual accuracy by mapping confidence and semantic entropy into natural-language knowledge-state quadrants for trustworthiness and honestness, then applying PPO with a combined reward and retrieval...
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Self-Refine: Iterative Refinement with Self-Feedback
cs.CL 2023-03 unverdicted novelty 5.0

Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 17 Pith papers

[1]

We asked for labelers to aim to have answers with different kinds of ﬂaws, e.g

We sometimes also collected misleading answers that should be clearly wrong, but take readers a long time to determine as wrong. We asked for labelers to aim to have answers with different kinds of ﬂaws, e.g. accuracy ﬂaws contradicting part of the text that are hard to ﬁnd or not stated explicitly and coverage ﬂaws leaving out important details that are ...

work page
[2]

quote corroborations

We sometimes asked for lists of "quote corroborations". For each quote corroboration, the labeler highlighted a set of spans in the answer, and a set of corroborating spans in the passage A.2.2 Critique task When collecting data for the critique task, we show labelers a passage, multiple questions about the passage, and multiple model-generated answers fo...

work page
[3]

corroborations

Collecting a set of "corroborations" for each answer, of natural language explanations that support the answer

work page
[4]

For topic-based summarization, we asked for a category for each critique, one of: • Coverage: summary missing relevant information from passage • Accuracy: summary giving incorrect information • Coherence: summary is poorly written, confusing or nonsensical • Other: a catch-all bucket for everything else

work page
[5]

passage highlights

For topic-based summarization, we also explored collecting quotes. For each critique, we asked them to give "passage highlights", required for Coverage critiques, and "answer highlights", required for Accuracy and Coherence critiques. The answer highlights would be spans in either the original answer or a reﬁnement being critiqued. A.2.3 Helpfulness task ...

work page
[6]

clearly helpful

We sometimes asked labelers to mark when critiques were "clearly helpful", meaning they were unambiguously helpful rather than nit-picky

work page
[7]

26 A.3 Base tasks Early in the project, we asked labelers to create question-answering and summarization tasks

We sometimes asked labelers to mark severity and category of all model-generated critiques marked as helpful. 26 A.3 Base tasks Early in the project, we asked labelers to create question-answering and summarization tasks. How- ever, we later switched to topic-based summarization and used that for the majority of the project. As noted, our results are repo...

work page
[8]

During model training, we include the auxiliary task of creating a slate of questions given a passage

Question creation Our labelers were asked to write 1-8 questions based on a passage and give demonstrations of answers to those questions (topic-based summarization or otherwise) at the same time. During model training, we include the auxiliary task of creating a slate of questions given a passage

work page
[9]

In general, it is most interesting to critique things that are explanation-like, as opposed to short answers with no explanation (e.g

Corroborations We explored collecting corroborations of answers, which explain why aspects of an answer are correct. In general, it is most interesting to critique things that are explanation-like, as opposed to short answers with no explanation (e.g. a mathematical proof rather than just a statement). With topic-based summarization, this was less importa...

work page
[10]

We also include a variant which conditions on the span of the answer being corroborated

Corroboration quotes We include the task of retrieving relevant quotes from the passage which corroborate an answer. We also include a variant which conditions on the span of the answer being corroborated

work page
[11]

At training time, we sometimes convert between various tasks as a form of data augmentation

Question quotes We include the task of retrieving relevant quotes from the passage, based only on the question. At training time, we sometimes convert between various tasks as a form of data augmentation. The most important one is that we convert conditional reﬁnement tasks to direct reﬁnement tasks 50% of the time. We also convert corroboration quotes to...

work page
[12]

explaining to humans everything it knows,

The deﬁnition of ΣP 2 corresponds precisely to 2-step recursive reward modeling (RRM): we give the veriﬁer access to a model trained with RLHP (analogous to anNP/co-NP oracle). In general, n-step recursive reward modeling corresponds to the nth level of the polynomial hierarchy, just like n-step debate. We can interpret the assistance needed to move up th...

work page
[13]

chain of thought

V ocalizing critiques helps a model understand how to discriminate, as a "chain of thought" [WWS+22b, WWS+22a]

work page
[14]

For example, we could search for critiques (see Section D) Recall that in Section 5 we found a negative CD gap for the Addition, Alphabetize, and RACE synthetic tasks

More generally, if we do not control for compute. For example, we could search for critiques (see Section D) Recall that in Section 5 we found a negative CD gap for the Addition, Alphabetize, and RACE synthetic tasks. We suspect this is due to C’s usage of the oracle and that we would haveCm < D but do not investigate further in this work. C.3.2 Alternati...

work page
[15]

Essentially, we show the human a slate of critiques (like we did in Section 3.4)

C h8: Choose 8 critiques to show to the human, and pick the answer according to validity and severity of the most critical critique. Essentially, we show the human a slate of critiques (like we did in Section 3.4)

work page
[16]

This is just like C, but with a better critique

C 8: Choose a single best-of-8 (according to helpfulness score) critique to show to the human, who uses validity and severity in order to judge the answer. This is just like C, but with a better critique

work page
[17]

helpfulness score of best-of-N critiques

C m8: Choose a single best-of-8 (according to helpfulness score) critique, and simply use helpfulness score. This cuts out the human from the loop entirely but relies on helpfulness being comparable across different answers (ideally we would use a severity model). Ch8 corresponds to giving humans a chance to review multiple pieces of assistance, similar t...

work page
[18]

Best-of-N according to a helpfulness model appears to asymptote to worse than a direct critiqueability model

The best-of-N helpfulness score discriminates better with increasing N 33 Figure 14: Discrimination between misleading and honest answers. Best-of-N according to a helpfulness model appears to asymptote to worse than a direct critiqueability model. This suggests using helpfulness labels directly as a training signal for debate might not be preferred to tr...

work page
[19]

The best-of-N helpfulness score never discriminates as well as the critiqueability score

work page
[20]

This suggests naive 2-step debate wouldn’t work as well as RLHP for this task

The best-of-N helpfulness score appears to be asymptoting with N→∞ to worse than the critiqueability score. This suggests naive 2-step debate wouldn’t work as well as RLHP for this task. While these are negative results, they don’t imply that debate won’t start working with larger models. Some other important caveats:

work page
[21]

a critique-conditional preference-based reward model) would help

Our helpfulness model may primarily prioritize nit-picky critiques rather than severe ones, so we expect training a severity model (or e.g. a critique-conditional preference-based reward model) would help

work page
[22]

ground truth

Furthermore, we are primarily interested in testing in domains where giving critiqueability labels is even more difﬁcult. (a) With human labels, it seems interesting if critiqueability labels come from humans being misled in systematically biased ways. In our case, labels’ performance may simply be poor due to noise from occasional carelessness. This does...

work page 1955