hub Canonical reference

Large Language Models Can Self-Improve

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu · 2022 · cs.CL · arXiv 2210.11610

Canonical reference. 80% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 23 citing papers arXiv PDF

abstract

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 use method 1

representative citing papers

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.

Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

cs.CL · 2026-04-19 · unverdicted · novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

cs.CL · 2026-04-23 · unverdicted · novelty 6.0

Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.

Pioneer Agent: Continual Improvement of Small Language Models in Production

cs.AI · 2026-04-10 · unverdicted · novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

cs.LG · 2025-05-21 · unverdicted · novelty 6.0

Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

cs.CV · 2024-11-15 · unverdicted · novelty 6.0

LLaVA-CoT adds autonomous multistage reasoning to vision-language models, delivering 9.4% gains over its base model and outperforming larger models like Gemini-1.5-pro on reasoning benchmarks via a 100k annotated dataset and SWIRES test-time scaling.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

cs.AI · 2024-08-01 · conditional · novelty 6.0

Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

cs.CL · 2024-06-05 · conditional · novelty 6.0

OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

cs.CL · 2023-09-29 · conditional · novelty 6.0

ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

cs.CL · 2023-09-21 · conditional · novelty 6.0

Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

cs.CL · 2023-09-01 · conditional · novelty 6.0

RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

cs.CL · 2023-05-03 · conditional · novelty 6.0

Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.

ART: Automatic multi-step reasoning and tool-use for large language models

cs.CL · 2023-03-16 · unverdicted · novelty 6.0

ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.

FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion

cs.LG · 2026-04-21 · unverdicted · novelty 5.0

FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.

Kimi K2: Open Agentic Intelligence

cs.LG · 2025-07-28 · unverdicted · novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology

cs.SE · 2024-02-02 · unverdicted · novelty 4.0

CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.

The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences

cs.CL · 2025-09-14 · unverdicted · novelty 3.0 · 2 refs

The paper reduces a broad set of prompt engineering techniques to six core approaches and applies them to life sciences use cases while addressing common LLM pitfalls.

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

cs.HC · 2024-01-10 · unverdicted · novelty 3.0

This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

Natural Language Processing in the Legal Domain

cs.CL · 2023-02-23 · unverdicted · novelty 3.0

A survey of nearly 1000 NLP & Law papers from 2013-2024 documenting increases in publication volume, scope, methodological sophistication, and data/code availability.

citing papers explorer

Showing 23 of 23 citing papers.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines cs.CL · 2023-10-05 · conditional · none · ref 22 · internal anchor
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution cs.CL · 2023-09-28 · unverdicted · none · ref 154 · internal anchor
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media cs.CL · 2026-05-16 · unverdicted · none · ref 91 · internal anchor
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning cs.CL · 2026-04-19 · unverdicted · none · ref 54 · internal anchor
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
Meta-CoT: Enhancing Granularity and Generalization in Image Editing cs.CV · 2026-04-27 · unverdicted · none · ref 20 · internal anchor
Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models cs.CL · 2026-04-23 · unverdicted · none · ref 3 · internal anchor
Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 40 · internal anchor
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning cs.LG · 2025-05-21 · unverdicted · none · ref 34 · internal anchor
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step cs.CV · 2024-11-15 · unverdicted · none · ref 25 · internal anchor
LLaVA-CoT adds autonomous multistage reasoning to vision-language models, delivering 9.4% gains over its base model and outperforming larger models like Gemini-1.5-pro on reasoning benchmarks via a 100k annotated dataset and SWIRES test-time scaling.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 129 · internal anchor
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 157 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision cs.CL · 2024-06-05 · conditional · none · ref 8 · internal anchor
OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving cs.CL · 2023-09-29 · conditional · none · ref 16 · internal anchor
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models cs.CL · 2023-09-21 · conditional · none · ref 27 · internal anchor
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback cs.CL · 2023-09-01 · conditional · none · ref 122 · internal anchor
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes cs.CL · 2023-05-03 · conditional · none · ref 80 · internal anchor
Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
ART: Automatic multi-step reasoning and tool-use for large language models cs.CL · 2023-03-16 · unverdicted · none · ref 135 · internal anchor
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion cs.LG · 2026-04-21 · unverdicted · none · ref 94 · internal anchor
FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.
Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 27 · internal anchor
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology cs.SE · 2024-02-02 · unverdicted · none · ref 42 · internal anchor
CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.
The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences cs.CL · 2025-09-14 · unverdicted · none · ref 61 · 2 links · internal anchor
The paper reduces a broad set of prompt engineering techniques to six core approaches and applies them to life sciences use cases while addressing common LLM pitfalls.
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security cs.HC · 2024-01-10 · unverdicted · none · ref 243 · internal anchor
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.
Natural Language Processing in the Legal Domain cs.CL · 2023-02-23 · unverdicted · none · ref 41 · internal anchor
A survey of nearly 1000 NLP & Law papers from 2013-2024 documenting increases in publication volume, scope, methodological sophistication, and data/code availability.

Large Language Models Can Self-Improve

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer