arxiv: 2303.17651 · v2 · submitted 2023-03-30 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan , Niket Tandon , Prakhar Gupta , Skyler Hallinan , Luyu Gao , Sarah Wiegreffe , Uri Alon , Nouha Dziri

show 8 more authors

Shrimai Prabhumoye Yiming Yang Shashank Gupta Bodhisattwa Prasad Majumder Katherine Hermann Sean Welleck Amir Yazdanbakhsh Peter Clark

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords self-refinementiterative improvementself-feedbacklarge language modelsLLM output refinementtest-time scalingdialog generationmathematical reasoning

0 comments

The pith

Large language models can improve their own outputs by iteratively generating feedback and refinements without any training or extra models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Self-Refine, a method where an LLM first produces an output, then uses the same model to critique that output and generate a better version, repeating the process until improvements stop. This requires no supervised data, fine-tuning, or separate reward models, only the base LLM acting as generator, critic, and refiner in turn. The authors test it on seven tasks spanning dialog, reasoning, and more with GPT-3.5, ChatGPT, and GPT-4, finding that human judges and automatic scores favor the refined results, with roughly 20 percent absolute gains over plain one-step generation. A reader would care because the technique shows a straightforward way to extract more quality from existing models at inference time.

Core claim

Self-Refine demonstrates that the same LLM can generate an initial response, produce specific feedback on its shortcomings, and then produce an improved response based on that feedback, repeating the cycle as needed. When applied across dialog response generation, mathematical reasoning, and other tasks, this iterative self-correction yields outputs that both humans and metrics rate higher than the model's direct one-shot answers, with average task performance rising by about 20 percent absolute.

What carries the argument

Self-Refine, the three-step loop in which one LLM generates an output, writes feedback on it, and then rewrites the output to address the feedback, all without external supervision.

If this is right

Task performance rises by roughly 20 percent on average over direct generation across dialog, reasoning, and similar problems.
Human evaluators consistently prefer the outputs after self-refinement to the initial one-step versions.
The gains hold for current top models such as GPT-4 without requiring any new training data or reinforcement learning.
The method applies uniformly to the seven tested tasks without task-specific engineering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Test-time iteration of this kind could serve as a lightweight substitute for additional pretraining or fine-tuning on some tasks.
The approach may reduce certain error types such as factual inconsistencies if the feedback step reliably catches them.
Combining the loop with existing prompting styles like chain-of-thought could produce further additive gains.

Load-bearing premise

The LLM must be able to produce accurate and actionable feedback on its own outputs that genuinely leads to better results rather than neutral changes or new mistakes.

What would settle it

A controlled test on any of the evaluated tasks in which multiple rounds of Self-Refine produce outputs that score no higher, or lower, than the model's standard single-pass generation on the same human or automatic metrics.

read the original abstract

Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it to refine itself, iteratively. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner, and feedback provider. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Across all evaluated tasks, outputs generated with Self-Refine are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by ~20% absolute on average in task performance. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-Refine shows consistent gains from a generate-critique-refine loop on seven tasks, but the gains could stem from extra inference steps rather than the quality of the self-feedback.

read the letter

Self-Refine has one LLM generate an initial output, then write feedback on it, then produce a revised version, repeating the cycle. The headline result is that this improves performance over single-pass generation on tasks from dialog to math reasoning, with human raters favoring the outputs and automatic metrics rising about 20% on average across the board. The same pattern holds for GPT-3.5, ChatGPT, and GPT-4 with no extra training or data required.

Referee Report

4 major / 2 minor

Summary. The paper introduces Self-Refine, a training-free iterative method in which a single LLM first generates an initial output, then uses the same model to produce self-feedback on that output, and finally refines the output based on the feedback; the process can be repeated. The approach is evaluated on seven diverse tasks (dialogue, reasoning, code generation, etc.) with GPT-3.5, ChatGPT, and GPT-4, claiming that Self-Refine outputs are preferred by both human judges and automatic metrics over standard one-step generation, with an average absolute improvement of approximately 20%.

Significance. If the reported gains are shown to arise from genuine self-refinement rather than confounds, the result would be significant: it would demonstrate that current frontier LLMs can be improved at inference time through simple, standalone self-interaction without any additional training data, RL, or external models, providing a broadly applicable technique across NLP tasks.

major comments (4)

[Evaluation / Results] The central empirical claim rests on the unverified assumption that the LLM produces accurate and actionable self-feedback. The manuscript provides no quantitative breakdown (e.g., human or automatic annotation of feedback correctness, error-identification rate, or adherence rate in the subsequent refinement step) in the evaluation or results sections; without this, it remains possible that the ~20% average lift arises from repeated sampling, longer context, or extra inference steps rather than iterative self-correction.
[Experiments] No controls are reported for output length or total token usage. Iterative refinement typically produces longer responses; the paper does not compare against length-matched baselines or report token counts, leaving open the possibility that metric improvements (especially on tasks where verbosity correlates with quality) are partly driven by this confound rather than the refinement mechanism itself.
[Results] The ~20% average improvement is presented without per-task variances, statistical significance tests, confidence intervals, or the exact number of iterations used per task and model. These details are necessary to establish that the gains are robust and not driven by a subset of tasks or unstable runs.
[Method] Prompt templates for the initial generation, feedback, and refinement stages are not provided in sufficient detail (or in an appendix), which prevents exact reproduction and makes it impossible to determine whether the self-feedback prompts were carefully engineered or whether the method generalizes beyond the specific prompts used.

minor comments (2)

[Abstract] The abstract states an average ~20% absolute improvement but does not specify which automatic metrics were used for each task; adding this information would improve clarity.
[Related Work] Related work on self-consistency, chain-of-thought, and other test-time scaling methods is mentioned but could be expanded with more precise comparisons of computational cost and performance deltas.

Simulated Author's Rebuttal

4 responses · 0 unresolved

Thank you for your thorough and constructive review. We appreciate the feedback and will address each major comment below, proposing specific revisions to strengthen the manuscript where appropriate.

read point-by-point responses

Referee: [Evaluation / Results] The central empirical claim rests on the unverified assumption that the LLM produces accurate and actionable self-feedback. The manuscript provides no quantitative breakdown (e.g., human or automatic annotation of feedback correctness, error-identification rate, or adherence rate in the subsequent refinement step) in the evaluation or results sections; without this, it remains possible that the ~20% average lift arises from repeated sampling, longer context, or extra inference steps rather than iterative self-correction.

Authors: We agree that a direct quantitative analysis of self-feedback quality would provide stronger support for the mechanism. Although human preference judgments and automatic metric gains indicate effective refinements, we will add a new analysis subsection reporting human-annotated feedback correctness, error identification rates, and adherence in the refinement step on sampled instances from multiple tasks. To address confounds such as repeated sampling or extra steps, we will also include comparisons against best-of-n sampling baselines with matched inference budgets. revision: yes
Referee: [Experiments] No controls are reported for output length or total token usage. Iterative refinement typically produces longer responses; the paper does not compare against length-matched baselines or report token counts, leaving open the possibility that metric improvements (especially on tasks where verbosity correlates with quality) are partly driven by this confound rather than the refinement mechanism itself.

Authors: We acknowledge the importance of controlling for length and token usage. In the revision, we will report average token counts and output lengths for baseline and Self-Refine outputs across all tasks and models. We will further add length-matched baseline comparisons, for example by constraining generation length in the one-step baseline or by length-normalized evaluation. revision: yes
Referee: [Results] The ~20% average improvement is presented without per-task variances, statistical significance tests, confidence intervals, or the exact number of iterations used per task and model. These details are necessary to establish that the gains are robust and not driven by a subset of tasks or unstable runs.

Authors: We will revise the results section to include per-task scores with standard deviations, 95% confidence intervals, and statistical significance tests (paired t-tests or equivalent) between Self-Refine and baselines. We will also explicitly state the iteration counts used per task and model (typically 2–3 iterations or until convergence). revision: yes
Referee: [Method] Prompt templates for the initial generation, feedback, and refinement stages are not provided in sufficient detail (or in an appendix), which prevents exact reproduction and makes it impossible to determine whether the self-feedback prompts were carefully engineered or whether the method generalizes beyond the specific prompts used.

Authors: We apologize for the omission. The revised manuscript will include all prompt templates in full detail in a dedicated appendix, covering the exact wording for initial generation, feedback, and refinement stages for each task and model. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation

full rationale

The paper introduces Self-Refine as an empirical prompting technique that uses the same LLM for generation, feedback, and refinement, then evaluates it on seven tasks against one-step baselines. No equations, derivations, fitted parameters, or first-principles predictions appear in the provided text. Claims rest on human and automatic metric comparisons showing ~20% average gains, not on any reduction of outputs to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes that would collapse the result. The core assumption about feedback quality is an unverified empirical hypothesis tested only via downstream task metrics, which is a validity concern rather than circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs possess sufficient meta-reasoning ability to critique and improve their own outputs; no free parameters or new entities are introduced.

axioms (1)

domain assumption A single LLM can generate useful, actionable feedback on its own outputs that leads to measurable improvement when used for refinement
This assumption is required for the method to work without external supervision or training and is not derived or proven in the abstract.

pith-pipeline@v0.9.0 · 5585 in / 1229 out tokens · 62979 ms · 2026-05-10T20:43:04.617396+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the same LLMs provides feedback for its output and uses it to refine itself, iteratively

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue
cs.AI 2026-05 unverdicted novelty 7.0

Bot-Mod uses multi-turn dialogue guided by Gibbs sampling over intent hypotheses to identify malicious agent behavior in communities, showing reliable detection with low false positives on a Moltbook-derived dataset.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
cs.SE 2026-05 unverdicted novelty 7.0

PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
cs.LG 2026-05 unverdicted novelty 7.0

CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
cs.AI 2026-05 unverdicted novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
cs.LG 2026-05 unverdicted novelty 7.0

AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs
cs.AI 2026-04 unverdicted novelty 7.0

GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
cs.CR 2026-04 unverdicted novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
cs.AI 2026-04 unverdicted novelty 7.0

HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
cs.CR 2026-04 unverdicted novelty 7.0

Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes ac...
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
cs.AI 2026-04 unverdicted novelty 7.0

Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
Constraint-Aware Corrective Memory for Language-Based Drug Discovery Agents
cs.AI 2026-04 unverdicted novelty 7.0

CACM improves language-based drug discovery agents by 36.4% via protocol auditing, a grounded diagnostician, and compressed static/dynamic/corrective memory channels that localize failures and bias corrections.
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor
cs.SE 2026-04 unverdicted novelty 7.0

ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
cs.CL 2023-12 accept novelty 7.0

A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.
Large Language Models as Optimizers
cs.LG 2023-09 unverdicted novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
Reflexion: Language Agents with Verbal Reinforcement Learning
cs.AI 2023-03 conditional novelty 7.0

Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning
cs.CL 2026-05 unverdicted novelty 6.0

RTLC prompting lifts Claude 3.7 Sonnet pairwise accuracy on 350 hard JudgeBench items from 64.6% to 78.6% via a Research-Teach-Critique scaffold that beats self-consistency.
Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue
cs.AI 2026-05 unverdicted novelty 6.0

BOT-MOD uncovers hidden agent intent in multi-agent environments like Moltbook through guided multi-turn dialogue and Gibbs-based sampling over intent hypotheses.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement
cs.AI 2026-05 unverdicted novelty 6.0

PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
cs.CL 2026-05 conditional novelty 6.0

Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
LoopTrap: Termination Poisoning Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 6.0

LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
cs.AI 2026-05 unverdicted novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
cs.MA 2026-05 unverdicted novelty 6.0

CAFE detects positive distributional Jensen Gaps across five multi-agent LLM architectures on a banking-risk benchmark, showing that quality drops under semantic stress can coexist with statistically detectable antifr...
When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
cs.MA 2026-05 unverdicted novelty 6.0

CAFE finds positive distributional Jensen Gaps across five multi-agent LLM architectures under semantic stress, showing that quality drops can coexist with detectable stress geometry compatible with antifragile learning.
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
cs.AI 2026-05 unverdicted novelty 6.0

FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
cs.SE 2026-05 unverdicted novelty 6.0

RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 6.0

Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
cs.AI 2026-04 unverdicted novelty 6.0

Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA
cs.IR 2026-04 unverdicted novelty 6.0

Rabtriever distills a generative reranker into an efficient independent encoder using JEPA and auxiliary reverse KL loss to achieve linear complexity and strong performance on rationale-based retrieval tasks.
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA
cs.IR 2026-04 unverdicted novelty 6.0

Rabtriever distills a generative reranker into an efficient bi-encoder using on-policy JEPA to achieve near-reranker accuracy with linear complexity on rationale-based retrieval.
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
cs.MA 2026-04 unverdicted novelty 6.0

QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
Preregistered Belief Revision Contracts
cs.AI 2026-04 unverdicted novelty 6.0

PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
ReflectCAP: Detailed Image Captioning with Reflective Memory
cs.AI 2026-04 unverdicted novelty 6.0

ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-cov...
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
cs.AI 2026-04 unverdicted novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code
cs.SE 2026-04 unverdicted novelty 6.0

Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
QoS-QoE Translation with Large Language Model
cs.MM 2026-04 unverdicted novelty 6.0

A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.
SeLaR: Selective Latent Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
cs.SE 2026-03 unverdicted novelty 6.0

Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
cs.LG 2024-07 unverdicted novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
cs.CL 2023-10 unverdicted novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
Improving Factuality and Reasoning in Language Models through Multiagent Debate
cs.CL 2023-05 unverdicted novelty 6.0

Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.
Teaching Large Language Models to Self-Debug
cs.CL 2023-04 unverdicted novelty 6.0

Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
ALGOGEN: Tool-Generated Verifiable Traces for Reliable Algorithm Visualization
cs.AI 2026-05 unverdicted novelty 5.0

ALGOGEN improves LLM-generated algorithm visualizations by splitting simulation into traceable JSON outputs via Visualization Trace Algebra and using Rendering Style Language for reliable rendering, raising success ra...
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
cs.AI 2026-05 unverdicted novelty 5.0

Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
cs.CL 2026-05 unverdicted novelty 5.0

An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
cs.AI 2026-05 unverdicted novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling
cs.AI 2026-05 unverdicted novelty 5.0

Multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points respectively at equal compute budgets on MMLU-Pro and BBH, with advantages that continue at higher scales while sel...
State Representation and Termination for Recursive Reasoning Systems
cs.AI 2026-05 unverdicted novelty 5.0

Recursive reasoning systems can represent their state via an epistemic state graph and terminate when the linearized order-gap is non-degenerate near the fixed point, providing a local condition for when the stopping ...
Bolzano: Case Studies in LLM-Assisted Mathematical Research
cs.CL 2026-04 unverdicted novelty 5.0

A multi-agent LLM system autonomously produced publishable results on five out of eight mathematical and theoretical computer science problems.
HYPERHEURIST: A Simulated Annealing-Based Control Framework for LLM-Driven Code Generation in Optimized Hardware Design
cs.AR 2026-04 unverdicted novelty 5.0

HYPERHEURIST uses simulated annealing to refine functionally validated LLM-generated RTL designs, producing more stable PPA optimization than single-pass LLM generation across eight benchmarks.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 64 Pith papers · 10 internal anchors

[1]

Teresa M. Amabile. 1983. https://doi.org/10.1007/978-1-4612-5533-8_4 A Theoretical Framework . In The Social Psychology of Creativity , pages 65--96. Springer New York, New York, NY

work page doi:10.1007/978-1-4612-5533-8_4 1983
[2]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022 a . https://arxiv.org/abs/2204.05862 Training a helpful and harmless assistant with reinforcement learning from human feedback . ArXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022 b . Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Emery D Berger, Sam Stern, and Juan Altmayer Pizzorno. 2022. https://arxiv.org/abs/2212.07597 Triangulating Python Performance Issues with SCALENE https://arxiv.org/pdf/2212.07597.pdf . ArXiv preprint, abs/2212.07597

work page arXiv 2022
[5]

Lawrence D Brown, T Tony Cai, and Anirban DasGupta. 2001. Interval estimation for a binomial proportion. Statistical science, 16(2):101--133

work page 2001
[6]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020
[7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality

work page 2023
[9]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Sanjoy Dasgupta, Daniel Hsu, Stefanos Poulis, and Xiaojin Zhu. 2019. http://proceedings.mlr.press/v97/dasgupta19a.html Teaching a black-box learner . In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , volume 97 of Proceedings of Machine Learning Research, pages 1547--1555. PMLR

work page 2019
[11]

Wanyu Du, Zae Myung Kim, Vipul Raheja, Dhruv Kumar, and Dongyeop Kang. 2022. https://aclanthology.org/2022.in2writing-1.14 Read, revise, repeat: A system demonstration for human-in-the-loop iterative text revision . In Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022), pages 96--108, Dublin, Ireland. Ass...

work page 2022
[12]

Ahmed Elgohary, Christopher Meek, Matthew Richardson, Adam Fourney, Gonzalo Ramos, and Ahmed Hassan Awadallah. 2021. https://doi.org/10.18653/v1/2021.naacl-main.444 NL - EDIT : Correcting semantic parse errors through natural language interaction . In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Lin...

work page doi:10.18653/v1/2021.naacl-main.444 2021
[13]

Linda Flower and John R Hayes. 1981. A cognitive process theory of writing. College composition and communication, 32(4):365--387

work page 1981
[14]

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. https://arxiv.org/abs/2302.04166 Gptscore: Evaluate as you desire . arXiv preprint arXiv:2302.04166

work page arXiv 2023
[15]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2022. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435

work page Pith review arXiv 2022
[16]

Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023. https://bair.berkeley.edu/blog/2023/04/03/koala/ Koala: A dialogue model for academic research . Blog post

work page 2023
[18]

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. 2022 b . Coderl: Mastering code generation through pretrained models and deep reinforcement learning. ArXiv, abs/2207.01780

work page arXiv 2022
[19]

Juncen Li, Robin Jia, He He, and Percy Liang. 2018. https://doi.org/10.18653/v1/N18-1169 Delete, retrieve, generate: a simple approach to sentiment and style transfer . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1865--187...

work page doi:10.18653/v1/n18-1169 2018
[20]

Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.165 C ommon G en: A constrained text generation challenge for generative commonsense reasoning . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823--1840, Online. As...

work page doi:10.18653/v1/2020.findings-emnlp.165 2020
[21]

Jiacheng Liu, Skyler Hallinan, Ximing Lu, Pengfei He, Sean Welleck, Hannaneh Hajishirzi, and Yejin Choi. 2022. Rainier: Reinforced knowledge introspector for commonsense question answering. In Conference on Empirical Methods in Natural Language Processing

work page 2022
[22]

Ximing Lu, Sean Welleck, Liwei Jiang, Jack Hessel, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. 2022. Quark: Controllable text generation with reinforced unlearning. ArXiv, abs/2205.13636

work page arXiv 2022
[23]

Aman Madaan, Alexander Shypula, Uri Alon, Milad Hashemi, Parthasarathy Ranganathan, Yiming Yang, Graham Neubig, and Amir Yazdanbakhsh. 2023. Learning performance-improving code edits. arXiv preprint arXiv:2302.07867

work page arXiv 2023
[24]

Aman Madaan, Niket Tandon, Dheeraj Rajagopal, Peter Clark, Yiming Yang, and Eduard Hovy. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.508 Think about it! improving defeasible reasoning by first modeling the question scenario. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6291--6310, Online and Punta ...

work page doi:10.18653/v1/2021.emnlp-main.508 2021
[25]

Shikib Mehri and Maxine Eskenazi. 2020. https://aclanthology.org/2020.sigdial-1.28 Unsupervised evaluation of interactive dialog with D ialo GPT . In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225--235, 1st virtual meeting. Association for Computational Linguistics

work page 2020
[26]

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. https://arxiv.org/abs/2203.13474 Codegen: An open large language model for code with multi-turn program synthesis . ArXiv preprint, abs/2203.13474

work page internal anchor Pith review arXiv 2022
[27]

Model index for researchers

OpenAI . Model index for researchers. https://platform.openai.com/docs/model-index-for-researchers. Accessed: May 14, 2023

work page 2023
[28]

OpenAI. 2022. https://beta.openai.com/docs/model-index-for-researchers Model index for researchers . Blogpost

work page 2022
[29]

OpenAI. 2023. http://arxiv.org/abs/2303.08774 Gpt-4 technical report

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. 2022. https://arxiv.org/abs/2203.02155 Training langua...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. 2023. http://arxiv.org/abs/2302.12813 Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

work page arXiv 2023
[32]

Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W Black. 2018. https://doi.org/10.18653/v1/P18-1080 Style transfer through back-translation . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 866--876, Melbourne, Australia. Association for Computational Linguistics

work page doi:10.18653/v1/p18-1080 2018
[33]

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022. https://arxiv.org/abs/2210.03350 Measuring and narrowing the compositionality gap in language models . arXiv preprint arXiv:2210.03350

work page arXiv 2022
[34]

Ruchir Puri, David Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, and Frederick Reiss. 2021. https://arxiv.org/abs/2105.12655 Codenet: A large-scale ai for code dataset for learning a diver...

work page arXiv 2021
[35]

Machel Reid and Graham Neubig. 2022. Learning to model editing processes. arXiv preprint arXiv:2205.12374

work page arXiv 2022
[37]

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. 2022 b . https://arxiv.org/abs/2206.05802 Self-critiquing models for assisting human evaluators . ArXiv:2206.05802

work page arXiv 2022
[38]

J \'e r \'e my Scheurer, Jon Ander Campos, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. 2022. https://arxiv.org/abs/2204.14146 Training language models with natural language feedback . ArXiv:2204.14146

work page arXiv 2022
[39]

Timo Schick, Jane Dwivedi-Yu, Zhengbao Jiang, Fabio Petroni, Patrick Lewis, Gautier Izacard, Qingfei You, Christoforos Nalmpantis, Edouard Grave, and Sebastian Riedel. 2022 a . https://doi.org/10.48550/ARXIV.2208.11663 Peer: A collaborative language model

work page doi:10.48550/arxiv.2208.11663 2022
[40]

Timo Schick, Jane Dwivedi-Yu, Zhengbao Jiang, Fabio Petroni, Patrick Lewis, Gautier Izacard, Qingfei You, Christoforos Nalmpantis, Edouard Grave, and Sebastian Riedel. 2022 b . Peer: A collaborative language model. ArXiv, abs/2208.11663

work page arXiv 2022
[41]

Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. http://arxiv.org/abs/2303.11366 Reflexion: an autonomous agent with dynamic memory and self-reflection

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Herbert A. Simon. 1962. http://www.jstor.org/stable/985254 The architecture of complexity . Proceedings of the American Philosophical Society, 106(6):467--482

work page 1962
[43]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. https://proceedings.neurips.cc/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf Learning to summarize with human feedback . In Advances in Neural Information Processing Systems, volume 33, pages 3008--3021. C...

work page 2020
[44]

Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. 2023. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047

work page arXiv 2023
[45]

Niket Tandon, Aman Madaan, Peter Clark, Keisuke Sakaguchi, and Yiming Yang. 2021. Interscript: A dataset for interactive learning of scripts through error feedback. arXiv preprint arXiv:2112.07867

work page arXiv 2021
[46]

Niket Tandon, Aman Madaan, Peter Clark, and Yiming Yang. 2022. Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 339--352

work page 2022
[47]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903. arXiv preprint arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. 2022. Generating sequences by learning to self-correct. arXiv preprint arXiv:2211.00053

work page arXiv 2022
[50]

Kevin Yang, Nanyun Peng, Yuandong Tian, and Dan Klein. 2022. Re3: Generating longer stories with recursive reprompting and revision. In Conference on Empirical Methods in Natural Language Processing

work page 2022
[51]

Michihiro Yasunaga and Percy Liang. 2020. http://arxiv.org/abs/2005.10636 Graph-based, self-supervised program repair from diagnostic feedback . 37th Int. Conf. Mach. Learn. ICML 2020, PartF168147-14:10730--10739

work page arXiv 2020
[52]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28

work page 2015