pith. the verified trust layer for science. sign in

arxiv: 2210.11610 · v2 · pith:KENTDCMHnew · submitted 2022-10-20 · 💻 cs.CL

Large Language Models Can Self-Improve

Pith reviewed 2026-05-18 16:54 UTC · model grok-4.3

classification 💻 cs.CL
keywords self-improvementlarge language modelschain of thoughtself-consistencyunlabeled datareasoningfine-tuning
0
0 comments X

The pith

Large language models can self-improve their reasoning using only unlabeled data by fine-tuning on their own high-confidence answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that large language models are capable of self-improving their reasoning abilities without any labeled examples. The method involves using chain-of-thought prompting to generate multiple rationales for unlabeled questions and then selecting the most consistent, high-confidence answers. These self-produced solutions serve as the training targets for fine-tuning the original model. Results show notable performance gains on several reasoning benchmarks, reaching levels comparable to state-of-the-art systems. Ablation experiments highlight the importance of focusing the fine-tuning on reasoning paths.

Core claim

An LLM can self-improve with only unlabeled datasets. A pre-trained LLM generates high-confidence rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency. The LLM is then fine-tuned using those self-generated solutions as target outputs. This approach improves the general reasoning ability of a 540B-parameter LLM, with specific gains of 74.4% to 82.1% on GSM8K, 78.2% to 83.0% on DROP, 90.0% to 94.4% on OpenBookQA, and 63.4% to 67.9% on ANLI-A3, achieving state-of-the-art-level performance without any ground truth label.

What carries the argument

The mechanism of generating high-confidence rationale-augmented answers through Chain-of-Thought prompting and self-consistency to create fine-tuning targets from unlabeled data.

If this is right

  • The model's performance on mathematical reasoning tasks improves significantly.
  • Reading comprehension and natural language inference abilities are enhanced without external labels.
  • Fine-tuning on reasoning chains proves more effective than other self-supervision methods.
  • State-of-the-art results are attainable solely through internal model outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could enable repeated cycles of self-improvement where the updated model generates even better targets in subsequent rounds.
  • Reducing dependence on human-annotated data for advancing model capabilities in reasoning.
  • Potential extension to other domains where consistency in outputs can serve as a quality signal.

Load-bearing premise

Self-consistency in chain-of-thought generations reliably indicates high-quality answers that, when used as fine-tuning targets, will lead to actual improvements in the model's reasoning.

What would settle it

If fine-tuning the model on these self-generated high-confidence answers results in no improvement or worse performance on the reasoning benchmarks, that would show the self-improvement does not hold.

read the original abstract

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that large language models can self-improve their reasoning abilities using only unlabeled datasets. A pre-trained LLM generates 'high-confidence' rationale-augmented answers for unlabeled questions via Chain-of-Thought prompting and self-consistency; these self-generated solutions are then used as fine-tuning targets. Reported gains include GSM8K (74.4% to 82.1%), DROP (78.2% to 83.0%), OpenBookQA (90.0% to 94.4%), and ANLI-A3 (63.4% to 67.9%) on a 540B-parameter model, reaching state-of-the-art levels without ground-truth labels. Ablation studies indicate that fine-tuning on reasoning is critical.

Significance. If the results hold, this would be a significant contribution to NLP and reasoning research. It shows a scalable way for LLMs to bootstrap reasoning improvements from unlabeled data alone, reducing dependence on human-annotated supervision. The scale of the model and the magnitude of gains on multiple benchmarks make the finding notable, and the ablation studies help isolate the role of reasoning targets.

major comments (3)
  1. [Method] The method description provides no details on the self-consistency threshold, number of CoT samples drawn, or exact majority-vote rule used to select 'high-confidence' examples. This free parameter directly determines the quality and size of the fine-tuning set and is load-bearing for reproducing the claimed self-improvement.
  2. [Results] The results section reports benchmark gains after fine-tuning but contains no direct measurement of ground-truth accuracy on the filtered high-confidence subset (even on validation splits where labels are available). Without this, it is impossible to confirm that the selected targets are higher-quality than the base model's typical output rather than merely more numerous or format-matched.
  3. [Ablations] The ablation studies do not control for distribution shift between the self-generated training examples and the held-out test benchmarks. This omission weakens the attribution of gains specifically to self-improvement rather than to exposure to a new data distribution.
minor comments (2)
  1. [Abstract] The abstract claims 'state-of-the-art-level performance' without citing the specific prior SOTA numbers or models for each benchmark, making the comparison hard to evaluate.
  2. [Method] Notation for the self-consistency procedure could be clarified with a short pseudocode or explicit formula for the majority vote.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential impact of our work. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Method] The method description provides no details on the self-consistency threshold, number of CoT samples drawn, or exact majority-vote rule used to select 'high-confidence' examples. This free parameter directly determines the quality and size of the fine-tuning set and is load-bearing for reproducing the claimed self-improvement.

    Authors: We agree that these hyperparameters are essential for reproducibility. The manuscript describes the overall procedure using Chain-of-Thought prompting and self-consistency to filter high-confidence examples but does not enumerate the precise values. We will revise the Method section to specify the number of CoT samples generated per question, the exact majority-vote aggregation rule, and the confidence threshold applied for selection. revision: yes

  2. Referee: [Results] The results section reports benchmark gains after fine-tuning but contains no direct measurement of ground-truth accuracy on the filtered high-confidence subset (even on validation splits where labels are available). Without this, it is impossible to confirm that the selected targets are higher-quality than the base model's typical output rather than merely more numerous or format-matched.

    Authors: This criticism is fair. Although the overall benchmark improvements are reported, we did not include a direct accuracy evaluation of the filtered high-confidence subset against available ground-truth labels on validation splits. We will add this analysis in the revised Results section, reporting the ground-truth accuracy of the selected self-generated solutions on held-out validation data to demonstrate that the filtering step improves target quality. revision: yes

  3. Referee: [Ablations] The ablation studies do not control for distribution shift between the self-generated training examples and the held-out test benchmarks. This omission weakens the attribution of gains specifically to self-improvement rather than to exposure to a new data distribution.

    Authors: We acknowledge the value of explicitly ruling out distribution shift. Our existing ablations already show that fine-tuning on reasoning-augmented targets is critical while other formats yield smaller gains, which helps isolate the contribution beyond mere data exposure. To further address this, we will expand the ablation studies in the revision with an additional control that fine-tunes on self-generated non-reasoning outputs from the same unlabeled pool, thereby better attributing improvements to the reasoning content rather than distributional differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method: generate rationale-augmented answers on unlabeled data via CoT prompting plus self-consistency, filter high-confidence examples, and fine-tune the base LLM on them. Gains are reported on separate held-out benchmarks (GSM8K, DROP, OpenBookQA, ANLI-A3) rather than on the self-generated training targets themselves. No equation or procedure reduces by construction to a fitted parameter or self-generated label; the validation remains external. No load-bearing self-citation chains, uniqueness theorems, or ansatzes imported from prior author work are described in the provided text. The approach is therefore self-contained against standard external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that consistency across sampled reasoning paths correlates with answer quality and that fine-tuning on such data transfers to held-out test distributions.

free parameters (1)
  • self-consistency threshold
    Number of sampled paths and agreement level used to label an answer as high-confidence; value not stated in abstract but required to select training targets.
axioms (1)
  • domain assumption Self-consistency across multiple CoT generations indicates higher likelihood of correctness
    Invoked when selecting 'high-confidence' rationale-augmented answers for fine-tuning targets.

pith-pipeline@v0.9.0 · 5733 in / 1270 out tokens · 48099 ms · 2026-05-18T16:54:53.257432+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    cs.CL 2023-10 conditional novelty 8.0

    DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

  2. Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    cs.CL 2023-09 unverdicted novelty 8.0

    Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

  3. Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

  4. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

  5. Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.

  6. Pioneer Agent: Continual Improvement of Small Language Models in Production

    cs.AI 2026-04 unverdicted novelty 6.0

    Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

  7. The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

    cs.LG 2025-05 unverdicted novelty 6.0

    Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.

  8. LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    cs.CV 2024-11 unverdicted novelty 6.0

    LLaVA-CoT adds autonomous multistage reasoning to vision-language models, delivering 9.4% gains over its base model and outperforming larger models like Gemini-1.5-pro on reasoning benchmarks via a 100k annotated data...

  9. Training Language Models to Self-Correct via Reinforcement Learning

    cs.LG 2024-09 unverdicted novelty 6.0

    SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

  10. Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

    cs.AI 2024-08 conditional novelty 6.0

    Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.

  11. Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    cs.CL 2024-06 conditional novelty 6.0

    OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.

  12. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    cs.CL 2023-09 conditional novelty 6.0

    Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

  13. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

    cs.CL 2023-09 conditional novelty 6.0

    RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.

  14. ART: Automatic multi-step reasoning and tool-use for large language models

    cs.CL 2023-03 unverdicted novelty 6.0

    ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.

  15. FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion

    cs.LG 2026-04 unverdicted novelty 5.0

    FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.

  16. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  17. The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences

    cs.CL 2025-09 unverdicted novelty 3.0

    The paper reduces a broad set of prompt engineering techniques to six core approaches and applies them to life sciences use cases while addressing common LLM pitfalls.

  18. Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

    cs.HC 2024-01 unverdicted novelty 3.0

    This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 18 Pith papers · 8 internal anchors

  1. [1]

    Jimmy Ba and Rich Caruana

    URL https://arxiv.org/abs/2202.12040. Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? Advances in neural information processing systems, 27,

  2. [2]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    BIG bench collaboration. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ArXiv, abs/2206.04615,

  3. [3]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek B Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, ...

  4. [4]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457,

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christo- pher Hesse, and John Schulman. Training verifiers to solve math word problems. ArXiv, abs/2110.14168,

  6. [6]

    Honest students from untrusted teachers: Learning an interpretable question-answering pipeline from a pretrained language model

    Jacob Eisenstein, Daniel Andor, Bernd Bohnet, Michael Collins, and David Mimno. Honest students from untrusted teachers: Learning an interpretable question-answering pipeline from a pretrained language model. arXiv preprint arXiv:2210.02498,

  7. [7]

    Distilling the Knowledge in a Neural Network

    URL https://openreview.net/forum?id=SJgdnAVKDH. Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2(7),

  8. [8]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa

    URL https://arxiv.org/abs/2205.11822. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Neural Information Processing Systems (NeurIPS) ,

  9. [9]

    URL https://arxiv.org/abs/2204. 02329. Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. On the advance of making language models better reasoners. ArXiv, abs/2206.02336, 2022a. Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. On the advance of making language models better reaso...

  10. [10]

    org/abs/2004.14546

    URL https://arxiv. org/abs/2004.14546. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155,

  11. [11]

    Lost in the Middle: How Language Models Use Long Contexts

    ISSN 2307-387X. doi: 10.1162/tacl a 00465. URL https://doi.org/10.1162/tacl_a_00465. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In EMNLP,

  12. [12]

    Learning by distilling context

    Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context. arXiv preprint arXiv:2209.15189,

  13. [13]

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. ArXiv, abs/1905.00537,

  14. [14]

    Rationale- augmented ensembles in language models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Rationale- augmented ensembles in language models. ArXiv, abs/2207.00747, 2022a. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. ArXiv, abs/2203.11171, 2022b. Jason Wei, Maarten Bosma,...

  15. [15]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yo- gatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Brian Ichter, Fei Xia, Quoc Le, and Denny Zhou. Chain of t...

  16. [16]

    Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up panoptic segmentation

    doi: 10.1109/CVPR42600.2020.01070. Xi Ye and Greg Durrett. The unreliability of explanations in few-shot in-context learning,

  17. [17]

    Denny Zhou, Nathanael Scharli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schu- urmans, Olivier Bousquet, Quoc Le, and Ed Chi

    URL https://arxiv.org/abs/2203.14465. Denny Zhou, Nathanael Scharli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schu- urmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. ArXiv, abs/2205.10625, 2022a. Yongwei Zhou, Junwei Bao, Chaoqun Duan, Haipeng Sun, Jiahui Liang, Yifan Wang...

  18. [18]

    (2022), and setT = 0.7 for the language model after LMSI

    For multiple path decoding, we use a sampling temperature of T = 0 .5 with the pre-trained UL2 model following Tay et al. (2022), and setT = 0.7 for the language model after LMSI. We set the maximum number of decode steps to 256 for all experiments. The results are shown in Table

  19. [19]

    Table 10: Few-shot CoT prompts for GSM8K and SV AMP, from Wei et al. (2022b). 15 Q: Since the 1970s, U.S. governments have negotiated managed-trade agreements, such as the North Amer- ican Free Trade Agreement in the 1990s, the Dominican Republic-Central America Free Trade Agreement in 2006, and a number of bilateral agreements. In Europe, six countries f...

  20. [20]

    How many years did the European Coal and Steel Community exist? A: According to the passage, the European Coal and Steel Community was established in 1951 and became the EEC in

    Two core objectives of the EEC were the development of a common market, subsequently renamed the single market, and establishing a customs union between its member states. How many years did the European Coal and Steel Community exist? A: According to the passage, the European Coal and Steel Community was established in 1951 and became the EEC in

  21. [21]

    Q: Yes or no: Is it common to see frost during some college commencements? A: College commencement ceremonies can happen in December, May, and June

    The answer is no. Q: Yes or no: Is it common to see frost during some college commencements? A: College commencement ceremonies can happen in December, May, and June. December is in the winter, so there can be frost. Thus, there could be frost at some commencements. The answer is yes. Q: Yes or no: Could a llama birth twice during War in Vietnam (1945-46)...