hub Canonical reference

Self-critiquing models for assisting human evaluators

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward · 2022 · cs.CL · arXiv 2206.05802

Canonical reference. 83% of citing Pith papers cite this work as background.

27 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 27 citing papers arXiv PDF

abstract

We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. We study scaling properties of critiquing with both topic-based summarization and synthetic tasks. Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs. Larger models can also integrate their own self-critiques as feedback, refining their own summaries into better ones. Finally, we motivate and introduce a framework for comparing critiquing ability to generation and discrimination ability. Our measurements suggest that even large models may still have relevant knowledge they cannot or do not articulate as critiques. These results are a proof of concept for using AI-assisted human feedback to scale the supervision of machine learning systems to tasks that are difficult for humans to evaluate directly. We release our training datasets, as well as samples from our critique assistance experiments.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 5 support 1

representative citing papers

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

RankJudge creates paired multi-turn conversations with isolated single-turn flaws to generate unambiguous benchmarks for LLM-as-a-judge systems across ML, biomedicine, and finance domains.

Self-Correction as Feedback Control: Error Dynamics, Stability Thresholds, and Prompt Interventions in LLMs

cs.AI · 2026-04-24 · conditional · novelty 7.0

Self-correction in LLMs is stable and non-degrading only when ECR/EIR exceeds initial accuracy over (1-accuracy), with EIR below 0.5% cleanly separating helpful from harmful cases across models.

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.

Building a Precise Video Language with Human-AI Oversight

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.

Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

cs.CL · 2026-04-06 · unverdicted · novelty 6.0

Emotional framings induce distinct behavioral shifts and form a structured geometry in the final-layer activations of small language models, with pressure linked to shortcuts and calm to honesty.

No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

cs.AI · 2026-01-11 · unverdicted · novelty 6.0

ECHO jointly optimizes policy and critic via co-evolution, cascaded rollouts, and saturation-aware shaping to deliver non-stale feedback and higher success in open-world LLM agent RL.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

LLM Evaluators Recognize and Favor Their Own Generations

cs.CL · 2024-04-15 · unverdicted · novelty 6.0

LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.

Towards Understanding Sycophancy in Language Models

cs.CL · 2023-10-20 · conditional · novelty 6.0

Sycophancy is prevalent in state-of-the-art AI assistants and is likely driven in part by human preferences that favor agreement over truthfulness.

UltraFeedback: Boosting Language Models with Scaled AI Feedback

cs.CL · 2023-10-02 · unverdicted · novelty 6.0

UltraFeedback is a large-scale AI feedback dataset that enables effective alignment of open-source language models, yielding strong results on chat benchmarks.

Cognitive Architectures for Language Agents

cs.AI · 2023-09-05 · accept · novelty 6.0

CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.

Simple synthetic data reduces sycophancy in large language models

cs.CL · 2023-08-07 · unverdicted · novelty 6.0

Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.

Teaching Large Language Models to Self-Debug

cs.CL · 2023-04-11 · unverdicted · novelty 6.0

Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

cs.AI · 2023-03-31 · conditional · novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.

Language Models can Solve Computer Tasks

cs.CL · 2023-03-30 · accept · novelty 6.0

Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.

Measuring Progress on Scalable Oversight for Large Language Models

cs.HC · 2022-11-04 · unverdicted · novelty 6.0

Humans chatting with an unreliable LLM assistant outperform both the model alone and unaided humans on MMLU and time-limited QuALITY tasks.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

FBOS-RL is a feedback-driven bi-objective RL framework that combines Feedback-Guided Exploration Enhancement with Exploitation-oriented Policy Alignment and Exploration-oriented Capability Cultivation to raise training speed and final performance over GRPO under fixed rollout budgets.

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness

cs.CL · 2026-04-11 · unverdicted · novelty 5.0

FAITH improves LLM factual accuracy by mapping confidence and semantic entropy into natural-language knowledge-state quadrants for trustworthiness and honestness, then applying PPO with a combined reward and retrieval augmentation.

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

cs.AI · 2025-07-15 · unverdicted · novelty 5.0

Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.

citing papers explorer

Showing 27 of 27 citing papers.

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator cs.CL · 2026-05-20 · unverdicted · none · ref 48 · internal anchor
RankJudge creates paired multi-turn conversations with isolated single-turn flaws to generate unambiguous benchmarks for LLM-as-a-judge systems across ML, biomedicine, and finance domains.
Self-Correction as Feedback Control: Error Dynamics, Stability Thresholds, and Prompt Interventions in LLMs cs.AI · 2026-04-24 · conditional · none · ref 17 · internal anchor
Self-correction in LLMs is stable and non-degrading only when ECR/EIR exceeds initial accuracy over (1-accuracy), with EIR below 0.5% cleanly separating helpful from harmful cases across models.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF cs.CL · 2026-04-20 · unverdicted · none · ref 62 · internal anchor
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 281 · internal anchor
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Self-Rewarding Language Models cs.CL · 2024-01-18 · conditional · none · ref 39 · internal anchor
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification cs.CL · 2026-05-10 · unverdicted · none · ref 29 · internal anchor
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
Building a Precise Video Language with Human-AI Oversight cs.CV · 2026-04-22 · unverdicted · none · ref 53 · internal anchor
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.
Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models cs.CL · 2026-04-06 · unverdicted · none · ref 1 · internal anchor
Emotional framings induce distinct behavioral shifts and form a structured geometry in the final-layer activations of small language models, with pressure linked to shortcuts and calm to honesty.
No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning cs.AI · 2026-01-11 · unverdicted · none · ref 10 · internal anchor
ECHO jointly optimizes policy and critic via co-evolution, cascaded rollouts, and saturation-aware shaping to deliver non-stale feedback and higher success in open-world LLM agent RL.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 57 · internal anchor
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
LLM Evaluators Recognize and Favor Their Own Generations cs.CL · 2024-04-15 · unverdicted · none · ref 23 · internal anchor
LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.
Towards Understanding Sycophancy in Language Models cs.CL · 2023-10-20 · conditional · none · ref 18 · internal anchor
Sycophancy is prevalent in state-of-the-art AI assistants and is likely driven in part by human preferences that favor agreement over truthfulness.
UltraFeedback: Boosting Language Models with Scaled AI Feedback cs.CL · 2023-10-02 · unverdicted · none · ref 3 · internal anchor
UltraFeedback is a large-scale AI feedback dataset that enables effective alignment of open-source language models, yielding strong results on chat benchmarks.
Cognitive Architectures for Language Agents cs.AI · 2023-09-05 · accept · none · ref 69 · internal anchor
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
Simple synthetic data reduces sycophancy in large language models cs.CL · 2023-08-07 · unverdicted · none · ref 37 · internal anchor
Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
Teaching Large Language Models to Self-Debug cs.CL · 2023-04-11 · unverdicted · none · ref 122 · internal anchor
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society cs.AI · 2023-03-31 · conditional · none · ref 99 · internal anchor
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
Language Models can Solve Computer Tasks cs.CL · 2023-03-30 · accept · none · ref 58 · internal anchor
Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.
Measuring Progress on Scalable Oversight for Large Language Models cs.HC · 2022-11-04 · unverdicted · none · ref 60 · internal anchor
Humans chatting with an unreliable LLM assistant outperform both the model alone and unaided humans on MMLU and time-limited QuALITY tasks.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 2 · internal anchor
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning cs.LG · 2026-05-18 · unverdicted · none · ref 19 · internal anchor
FBOS-RL is a feedback-driven bi-objective RL framework that combines Feedback-Guided Exploration Enhancement with Exploitation-oriented Policy Alignment and Exploration-oriented Capability Cultivation to raise training speed and final performance over GRPO under fixed rollout budgets.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG · 2026-04-23 · unverdicted · none · ref 18 · internal anchor
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness cs.CL · 2026-04-11 · unverdicted · none · ref 5 · internal anchor
FAITH improves LLM factual accuracy by mapping confidence and semantic entropy into natural-language knowledge-state quadrants for trustworthiness and honestness, then applying PPO with a combined reward and retrieval augmentation.
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety cs.AI · 2025-07-15 · unverdicted · none · ref 71 · internal anchor
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 281 · internal anchor
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Self-Refine: Iterative Refinement with Self-Feedback cs.CL · 2023-03-30 · unverdicted · none · ref 37 · internal anchor
Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning cs.AI · 2026-04-11 · unreviewed · ref 11 · internal anchor

Self-critiquing models for assisting human evaluators

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer