arxiv: 2409.12917 · v2 · pith:OQMAWUXNnew · submitted 2024-09-19 · 💻 cs.LG

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar , Vincent Zhuang , Rishabh Agarwal , Yi Su , John D Co-Reyes , Avi Singh , Kate Baumli , Shariq Iqbal

show 10 more authors

Colton Bishop Rebecca Roelofs Lei M Zhang Kay McKinney Disha Shrivastava Cosmin Paduraru George Tucker Doina Precup Feryal Behbahani Aleksandra Faust

This is my paper

Pith reviewed 2026-05-17 11:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords self-correctionreinforcement learninglarge language modelsonline RLMATHHumanEvalGemini

0 comments

The pith

Multi-turn reinforcement learning trains language models to self-correct using only their own generated data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops SCoRe, a multi-turn online RL method that improves an LLM's ability to correct its own mistakes without multiple models or extra supervision. It first shows that supervised fine-tuning on offline model-generated correction traces usually fails due to distribution mismatch between collected errors and the model's own responses or due to collapse into ineffective correction modes. SCoRe instead trains under the model's current distribution of self-generated traces and adds regularization—an initial multi-turn RL phase plus a reward bonus—to steer learning toward correction behavior that works on new test problems. On Gemini 1.0 Pro and 1.5 Flash, this yields clear gains in self-correction accuracy on MATH and HumanEval benchmarks.

Core claim

SCoRe performs multi-turn online reinforcement learning directly on traces the model generates itself, starting with a policy-initialization phase of RL followed by a reward bonus that amplifies effective self-correction, thereby avoiding the distribution mismatch and behavior collapse that limit supervised fine-tuning on offline correction data.

What carries the argument

SCoRe, the multi-turn online RL procedure that trains under the model's own distribution of correction traces and uses phased initialization plus reward regularization to produce generalizable self-correction behavior.

If this is right

SCoRe raises self-correction rates by 15.6% on MATH with Gemini 1.0 Pro and by 9.1% on HumanEval with Gemini 1.5 Flash.
The method achieves state-of-the-art self-correction results using only self-generated data and no external models or supervision.
Variants of supervised fine-tuning on model-generated traces are shown to be insufficient for instilling reliable self-correction.
Regularization via an initial multi-turn RL phase and a reward bonus steers learning away from collapse into ineffective modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-generated-data RL pattern could be tested on other iterative tasks such as multi-step reasoning or code debugging.
If the regularization prevents collapse here, similar bonuses might stabilize other multi-turn RL applications in language models.
Applying SCoRe to open-source base models would test whether the reported gains require proprietary model families.

Load-bearing premise

Training under the model's own distribution of self-generated correction traces plus the described regularization will produce correction behavior that works on unseen test problems instead of high-reward but non-generalizable patterns.

What would settle it

Measuring self-correction accuracy on MATH after running SCoRe on a held-out set of problems; if the accuracy gain disappears or reverses relative to the base model, the central claim does not hold.

read the original abstract

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCoRe advances self-correction training for LLMs through online RL with regularization to avoid SFT pitfalls, though more experimental rigor would help confirm the results.

read the letter

The punchline is that SCoRe shows online RL can train effective self-correction in LLMs from self-generated traces alone, delivering 15.6% and 9.1% gains on MATH and HumanEval for Gemini 1.0 Pro and 1.5 Flash. What is new is the multi-turn online RL approach with a dedicated initialization phase and reward bonus designed to counter the distribution mismatch and behavior collapse that the authors identify in SFT on offline correction data. They make a reasonable case that training under the model's current policy distribution plus regularization steers learning toward test-time effective behavior instead of just high-reward responses. The paper does well in spelling out the failure modes of SFT and in proposing a practical fix that stays within self-generated data. This has direct relevance for making LLMs more dependable in domains like math and code without needing external supervision or stronger models. The soft spots are mostly around the experimental reporting. The abstract mentions quantitative improvements but does not detail controls, statistical significance, or precise measurement of self-correction success, which leaves some room for doubt on how general the gains are. The stress-test point about possible collapse to benchmark-specific modes is a fair one to examine; if the full results show the regularization works as intended across varied error types, that would address it. Minor point: the method still depends on benchmark-derived rewards, so claims of full self-supervision need careful wording. This paper is for researchers in LLM post-training and reinforcement learning who care about self-improvement capabilities. Anyone working on reliable deployment of models for error-sensitive tasks would find the ideas useful. It deserves a serious referee because the central argument holds up on the evidence presented and the method is reproducible in principle with the described setup. I recommend putting it through peer review rather than desk rejecting it.

Referee Report

3 major / 1 minor

Summary. The paper introduces SCoRe, a multi-turn online reinforcement learning method for training LLMs to self-correct using only self-generated data. It first shows that supervised fine-tuning (SFT) on offline model-generated correction traces often fails due to distribution mismatch between the data-collection policy and the model's responses or due to behavior collapse into ineffective correction modes. SCoRe mitigates this via online RL under the model's own trace distribution, combined with regularization consisting of an initial multi-turn RL phase to initialize a more stable policy followed by a reward bonus to encourage effective self-correction. Experiments on Gemini 1.0 Pro and 1.5 Flash report state-of-the-art self-correction, with absolute improvements of 15.6% and 9.1% on MATH and HumanEval respectively.

Significance. If the empirical gains prove robust under proper controls and generalize beyond the training distribution, the work would be significant for providing a scalable, supervision-light approach to instill self-correction in LLMs. The explicit treatment of distribution mismatch and collapse via online training plus targeted regularization (policy initialization + reward bonus) offers a concrete methodological advance over prior SFT-based attempts, with potential applicability to reasoning and code-generation tasks.

major comments (3)

[Abstract] Abstract: The central quantitative claim of 15.6% and 9.1% improvements is presented without any description of how self-correction success was measured (e.g., exact pass/fail criteria on MATH problems or HumanEval test cases), the precise baselines used for comparison, statistical significance tests, or variance across runs. This detail is load-bearing for the claim that SCoRe achieves effective test-time self-correction rather than benchmark-specific gains.
[Experimental results section] Experimental results section: The regularization components (initial multi-turn RL for policy initialization and reward bonus) are described as addressing behavior collapse, yet no ablation is reported that isolates the contribution of each component or tests whether the learned policy generalizes to novel error types or out-of-distribution problems. Without such controls, the reported gains could reflect fitting to high-reward traces seen during training rather than acquisition of a general correction capability.
[Method section] Method section: The reward bonus is listed among the free parameters, but the text provides no sensitivity analysis, default value, or justification for its scale. This leaves open whether the reported improvements depend on careful tuning of this hyperparameter or hold across reasonable choices.

minor comments (1)

[Abstract and Method] The abstract and method descriptions use the term 'self-correction behavior' without a precise operational definition (e.g., whether it requires the model to detect and fix its own errors in a single additional turn or over multiple turns). Adding a short formal definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We appreciate the opportunity to address the concerns regarding clarity in the abstract, the need for additional controls in the experiments, and hyperparameter details. We believe these points can be resolved through targeted revisions and clarifications without altering the core contributions of the work.

read point-by-point responses

Referee: [Abstract] Abstract: The central quantitative claim of 15.6% and 9.1% improvements is presented without any description of how self-correction success was measured (e.g., exact pass/fail criteria on MATH problems or HumanEval test cases), the precise baselines used for comparison, statistical significance tests, or variance across runs. This detail is load-bearing for the claim that SCoRe achieves effective test-time self-correction rather than benchmark-specific gains.

Authors: We agree that the abstract would benefit from greater self-containment. In the revised manuscript we will expand the abstract to briefly specify the success metrics (exact answer match after correction attempts on MATH; standard pass@1 on HumanEval test cases), the primary baselines (base model and SFT variants), and that gains are measured under the model's own self-correction loop at test time. Detailed protocols appear in the experimental setup; we will add a short clause noting that results reflect consistent trends across the reported model scales. Formal statistical significance testing and multi-seed variance were not computed in the original experiments owing to the scale of the RL runs, but we will note this limitation explicitly. revision: yes
Referee: [Experimental results section] Experimental results section: The regularization components (initial multi-turn RL for policy initialization and reward bonus) are described as addressing behavior collapse, yet no ablation is reported that isolates the contribution of each component or tests whether the learned policy generalizes to novel error types or out-of-distribution problems. Without such controls, the reported gains could reflect fitting to high-reward traces seen during training rather than acquisition of a general correction capability.

Authors: We acknowledge the value of isolating the regularization components. The current results already contrast full SCoRe against SFT baselines, which indirectly highlights the benefit of online training under the model's own distribution. In revision we will add explicit ablations that remove the policy-initialization phase and the reward bonus in turn, reporting the resulting performance drop. On generalization, evaluations use held-out test splits containing diverse error patterns; however, dedicated probes for entirely novel error types or strong distribution shifts were not performed. We will add a discussion of this scope limitation and list targeted generalization experiments as future work. revision: partial
Referee: [Method section] Method section: The reward bonus is listed among the free parameters, but the text provides no sensitivity analysis, default value, or justification for its scale. This leaves open whether the reported improvements depend on careful tuning of this hyperparameter or hold across reasonable choices.

Authors: We will revise the method section to state the default bonus value employed in the main experiments and the preliminary tuning procedure used to select it (balancing correction encouragement against the primary outcome reward). We will also include a short sensitivity table or plot showing performance for a modest range of bonus magnitudes around the chosen value, confirming that gains remain stable within that range. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SCoRe's empirical RL derivation

full rationale

The paper's core contribution is an empirical multi-turn online RL procedure (SCoRe) that trains on the model's own self-generated correction traces, initialized via a preliminary RL phase and regularized with a reward bonus. Reported gains (15.6% on MATH, 9.1% on HumanEval) are measured outcomes on external held-out benchmarks after training, not quantities defined by or equivalent to the training objective itself. SFT failure modes are diagnosed via direct experimental observation of distribution mismatch and collapse, not by definitional reduction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the central result; the method remains falsifiable against independent test distributions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions about reward signals guiding policy improvement and the premise that self-generated traces can serve as effective training data when properly regularized; no new entities are postulated.

free parameters (1)

reward bonus scale
Used to amplify self-correction; specific value not reported in abstract.

axioms (1)

domain assumption Multi-turn interactions can be modeled as a Markov decision process for LLM correction
Invoked implicitly when framing self-correction as multi-turn RL.

pith-pipeline@v0.9.0 · 5678 in / 1191 out tokens · 79062 ms · 2026-05-17T11:57:35.333634+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt.
IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
cs.CV 2025-03 unverdicted novelty 7.0

Seg-Zero uses cognitive reinforcement learning on a decoupled reasoning-plus-segmentation architecture to produce explicit reasoning chains and reach 57.5 zero-shot accuracy on ReasonSeg, beating prior supervised LISA...
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
cs.CL 2024-12 unverdicted novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
cs.CL 2026-05 unverdicted novelty 6.0

CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
cs.LG 2026-05 unverdicted novelty 6.0

LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
cs.AI 2026-05 unverdicted novelty 6.0

Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
cs.CL 2026-05 unverdicted novelty 6.0

PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
cs.SE 2026-05 unverdicted novelty 6.0

RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
cs.AI 2026-04 unverdicted novelty 6.0

A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
cs.CV 2025-11 unverdicted novelty 6.0

REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
cs.CV 2025-10 conditional novelty 6.0

Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
cs.CL 2024-12 unverdicted novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 5.0

CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
cs.AI 2025-01 unverdicted novelty 3.0

The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

Reference graph

Works this paper leans on

282 extracted references · 282 canonical work pages · cited by 17 Pith papers · 64 internal anchors

[1]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

A. Ahmadian, C. Cremer, M. Gall \'e , M. Fadaee, J. Kreutzer, A. \"U st \"u n, and S. Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

u rek, E. Aky \

A. F. Aky \"u rek, E. Aky \"u rek, A. Madaan, A. Kalyan, P. Clark, D. Wijaya, and N. Tandon. Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs. arXiv preprint arXiv:2305.08844, 2023

work page arXiv 2023
[3]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

X. Chen, M. Lin, N. Sch \"a rli, and D. Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Teaching large language models to reason with reinforcement learning

A. Havrilla, Y. Du, S. C. Raparthy, C. Nalmpantis, J. Dwivedi-Yu, M. Zhuravinskyi, E. Hambro, S. Sukhbaatar, and R. Raileanu. Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642, 2024 a

work page arXiv 2024
[9]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

work page 2021
[10]

J. Hong, N. Lee, and J. Thorne. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Large Language Models Cannot Self-Correct Reasoning Yet

J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and D. Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

G. Kim, P. Baldi, and S. McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023

work page internal anchor Pith review arXiv 2023
[17]

StarCoder 2 and The Stack v2: The Next Generation

A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

A. Ni, M. Allamanis, A. Cohan, Y. Deng, K. Shi, C. Sutton, and P. Yin. Next: Teaching large language models to reason about code execution. arXiv preprint arXiv:2404.14662, 2024

work page arXiv 2024
[20]

T. X. Olausson, J. P. Inala, C. Wang, J. Gao, and A. Solar-Lezama. Is self-repair a silver bullet for code generation? In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[21]

L. Pan, M. Saxon, W. Xu, D. Nathani, X. Wang, and W. Y. Wang. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188, 2023

work page arXiv 2023
[22]

D. Paul, M. Ismayilzada, M. Peyrard, B. Borges, A. Bosselut, R. West, and B. Faltings. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904, 2023

work page arXiv 2023
[23]

Y. Qu, T. Zhang, N. Garg, and A. Kumar. Recursive introspection: Teaching language model agents how to self-improve. arXiv preprint arXiv:2407.18219, 2024

work page arXiv 2024
[27]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, B. Labash, and A. Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Singh, J

A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, X. Garcia, P. J. Liu, J. Harrison, J. Lee, K. Xu, A. Parisi, A. Kumar, A. Alemi, A. Rizkowsky, A. Nova, B. Adlam, B. Bohnet, G. Elsayed, H. Sedghi, I. Mordatch, I. Simpson, I. Gur, J. Snoek, J. Pennington, J. Hron, K. Kenealy, K. Swersky, K. Mahajan, L. Culp, L. Xiao, M. L. Bileschi, N. Constant, R...

work page 2024
[31]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

C. Team. Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409, 2024

work page arXiv 2024
[33]

G. Tyen, H. Mansoor, V. C a rbune, Y. P. Chen, and T. Mak. Llms cannot find reasoning errors, but can correct them given the error location. In Findings of the Association for Computational Linguistics ACL 2024, pages 13894--13908, 2024

work page 2024
[34]

Solving math word problems with process- and outcome-based feedback

J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Welleck, X

S. Welleck, X. Lu, P. West, F. Brahman, T. Shen, D. Khashabi, and Y. Choi. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=hH36JeQZDaO

work page 2023
[39]

S. Ye, Y. Jo, D. Kim, S. Kim, H. Hwang, and M. Seo. Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post, 2023

work page 2023
[40]

T. Ye, Z. Xu, Y. Li, and Z. Allen-Zhu. Physics of language models: Part 2.2, how to learn from mistakes on grade-school math problems, 2024. URL https://arxiv.org/abs/2408.16293

work page arXiv 2024
[42]

Zelikman, Y

E. Zelikman, Y. Wu, J. Mu, and N. Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

work page 2022
[45]

Y. Zhou, A. Zanette, J. Pan, S. Levine, and A. Kumar. Archer: Training language model agents via hierarchical multi-turn rl. arXiv preprint arXiv:2402.19446, 2024

work page arXiv 2024
[46]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[47]

Publications Manual , year = "1983", publisher =

work page 1983
[48]

Do large language models latently perform multi-hop reasoning?arXiv preprint arXiv:2402.16837,

Do Large Language Models Latently Perform Multi-Hop Reasoning? , author=. arXiv preprint arXiv:2402.16837 , year=

work page arXiv
[49]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

arXiv preprint arXiv:2405.14655 , year=

Multi-turn Reinforcement Learning from Preference Human Feedback , author=. arXiv preprint arXiv:2405.14655 , year=

work page arXiv
[52]

arXiv preprint arXiv:2409.02392 , year=

Building Math Agents with Multi-Turn Iterative Preference Learning , author=. arXiv preprint arXiv:2409.02392 , year=

work page arXiv
[53]

arXiv preprint arXiv:2403.03950 , year=

Stop regressing: Training value functions via classification for scalable deep rl , author=. arXiv preprint arXiv:2403.03950 , year=

work page arXiv
[54]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024a

Small Language Models Need Strong Verifiers to Self-Correct Reasoning , author=. arXiv preprint arXiv:2404.17140 , year=

work page arXiv
[56]

Advances in neural information processing systems , volume=

Causal confusion in imitation learning , author=. Advances in neural information processing systems , volume=

work page
[57]

Self-critiquing models for assisting human evaluators

Self-critiquing models for assisting human evaluators , author=. arXiv preprint arXiv:2206.05802 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Blog post , year=

Selfee: Iterative self-revising llm empowered by self-feedback generation , author=. Blog post , year=

work page
[59]

arXiv preprint arXiv:2406.04520 , year=

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning , author=. arXiv preprint arXiv:2406.04520 , year=

work page arXiv
[60]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

LLMs cannot find reasoning errors, but can correct them given the error location , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

work page 2024
[61]

arXiv preprint arXiv:2406.01297 , year=

When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs , author=. arXiv preprint arXiv:2406.01297 , year=

work page arXiv
[62]

arXiv preprint arXiv:2312.06585 , year=

Beyond human data: Scaling self-training for problem-solving with language models , author=. arXiv preprint arXiv:2312.06585 , year=

work page arXiv
[63]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[64]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[65]

Dan Gusfield , title =. 1997

work page 1997
[66]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[67]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[68]

Advances in Neural Information Processing Systems , volume=

Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[69]

2022 , eprint=

Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

work page 2022
[70]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[71]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer and Mitchell Stern , title =. CoRR , volume =. 2018 , url =. 1804.04235 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[72]

Asynchronous Methods for Deep Reinforcement Learning

Volodymyr Mnih and Adri. Asynchronous Methods for Deep Reinforcement Learning , journal =. 2016 , url =. 1602.01783 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2016
[73]

2023 , eprint=

PaLM 2 Technical Report , author=. 2023 , eprint=

work page 2023
[74]

Hierarchical Neural Story Generation

Fan, Angela and Lewis, Mike and Dauphin, Yann. Hierarchical Neural Story Generation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1082

work page doi:10.18653/v1/p18-1082 2018
[75]

LaMDA: Language Models for Dialog Applications

Lamda: Language models for dialog applications , author=. arXiv preprint arXiv:2201.08239 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

PaLM: Scaling Language Modeling with Pathways

Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[78]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[79]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[81]

Proceedings of the AAAI Conference on Artificial Intelligence , pages=

Learning to extract coherent summary via deep reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=

work page
[82]

Improving alignment of dialogue agents via targeted human judgements

Improving alignment of dialogue agents via targeted human judgements , author=. arXiv preprint arXiv:2209.14375 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[83]

arXiv preprint arXiv:2307.16039 , year=

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback , author=. arXiv preprint arXiv:2307.16039 , year=

work page arXiv
[84]

arXiv preprint arXiv:1907.12894 , year=

Reward learning for efficient reinforcement learning in extractive document summarisation , author=. arXiv preprint arXiv:1907.12894 , year=

work page arXiv 1907
[85]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

A Study of Reinforcement Learning for Neural Machine Translation , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018
[86]

arXiv preprint arXiv:2304.01852 , year=

Summary of chatgpt/gpt-4 research and perspective towards the future of large language models , author=. arXiv preprint arXiv:2304.01852 , year=

work page arXiv
[87]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[88]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[89]

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. arXiv preprint arXiv:2404.05868 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[90]

Direct nash optimization: Teaching language models to self-improve with general preferences.arXiv preprint arXiv:2404.03715,

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences , author=. arXiv preprint arXiv:2404.03715 , year=

work page arXiv
[91]

2023 , note =

An overview of Bard: an early experiment with generative AI , author=. 2023 , note =

work page 2023
[92]

Advances in Neural Information Processing Systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

work page
[93]

2023 , note =

OpenAI Pricing , author =. 2023 , note =

work page 2023
[94]

2023 , note =

AI Platform Data Labeling Service pricing , author =. 2023 , note =

work page 2023
[95]

Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

Want To Reduce Labeling Cost? GPT-3 Can Help , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

work page 2021
[96]

arXiv preprint arXiv:2303.15056 , year=

Chatgpt outperforms crowd-workers for text-annotation tasks , author=. arXiv preprint arXiv:2303.15056 , year=

work page arXiv
[97]

Is GPT -3 a Good Data Annotator?

Ding, Bosheng and Qin, Chengwei and Liu, Linlin and Chia, Yew Ken and Li, Boyang and Joty, Shafiq and Bing, Lidong. Is GPT -3 a Good Data Annotator?. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.626

work page doi:10.18653/v1/2023.acl-long.626 2023
[98]

2023 , eprint=

RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment , author=. 2023 , eprint=

work page 2023

Showing first 80 references.