pith. machine review for the scientific record. sign in

arxiv: 2409.12917 · v2 · pith:OQMAWUXNnew · submitted 2024-09-19 · 💻 cs.LG

Training Language Models to Self-Correct via Reinforcement Learning

Pith reviewed 2026-05-17 11:57 UTC · model grok-4.3

classification 💻 cs.LG
keywords self-correctionreinforcement learninglarge language modelsonline RLMATHHumanEvalGemini
0
0 comments X

The pith

Multi-turn reinforcement learning trains language models to self-correct using only their own generated data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops SCoRe, a multi-turn online RL method that improves an LLM's ability to correct its own mistakes without multiple models or extra supervision. It first shows that supervised fine-tuning on offline model-generated correction traces usually fails due to distribution mismatch between collected errors and the model's own responses or due to collapse into ineffective correction modes. SCoRe instead trains under the model's current distribution of self-generated traces and adds regularization—an initial multi-turn RL phase plus a reward bonus—to steer learning toward correction behavior that works on new test problems. On Gemini 1.0 Pro and 1.5 Flash, this yields clear gains in self-correction accuracy on MATH and HumanEval benchmarks.

Core claim

SCoRe performs multi-turn online reinforcement learning directly on traces the model generates itself, starting with a policy-initialization phase of RL followed by a reward bonus that amplifies effective self-correction, thereby avoiding the distribution mismatch and behavior collapse that limit supervised fine-tuning on offline correction data.

What carries the argument

SCoRe, the multi-turn online RL procedure that trains under the model's own distribution of correction traces and uses phased initialization plus reward regularization to produce generalizable self-correction behavior.

If this is right

  • SCoRe raises self-correction rates by 15.6% on MATH with Gemini 1.0 Pro and by 9.1% on HumanEval with Gemini 1.5 Flash.
  • The method achieves state-of-the-art self-correction results using only self-generated data and no external models or supervision.
  • Variants of supervised fine-tuning on model-generated traces are shown to be insufficient for instilling reliable self-correction.
  • Regularization via an initial multi-turn RL phase and a reward bonus steers learning away from collapse into ineffective modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-generated-data RL pattern could be tested on other iterative tasks such as multi-step reasoning or code debugging.
  • If the regularization prevents collapse here, similar bonuses might stabilize other multi-turn RL applications in language models.
  • Applying SCoRe to open-source base models would test whether the reported gains require proprietary model families.

Load-bearing premise

Training under the model's own distribution of self-generated correction traces plus the described regularization will produce correction behavior that works on unseen test problems instead of high-reward but non-generalizable patterns.

What would settle it

Measuring self-correction accuracy on MATH after running SCoRe on a held-out set of problems; if the accuracy gain disappears or reverses relative to the base model, the central claim does not hold.

read the original abstract

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces SCoRe, a multi-turn online reinforcement learning method for training LLMs to self-correct using only self-generated data. It first shows that supervised fine-tuning (SFT) on offline model-generated correction traces often fails due to distribution mismatch between the data-collection policy and the model's responses or due to behavior collapse into ineffective correction modes. SCoRe mitigates this via online RL under the model's own trace distribution, combined with regularization consisting of an initial multi-turn RL phase to initialize a more stable policy followed by a reward bonus to encourage effective self-correction. Experiments on Gemini 1.0 Pro and 1.5 Flash report state-of-the-art self-correction, with absolute improvements of 15.6% and 9.1% on MATH and HumanEval respectively.

Significance. If the empirical gains prove robust under proper controls and generalize beyond the training distribution, the work would be significant for providing a scalable, supervision-light approach to instill self-correction in LLMs. The explicit treatment of distribution mismatch and collapse via online training plus targeted regularization (policy initialization + reward bonus) offers a concrete methodological advance over prior SFT-based attempts, with potential applicability to reasoning and code-generation tasks.

major comments (3)
  1. [Abstract] Abstract: The central quantitative claim of 15.6% and 9.1% improvements is presented without any description of how self-correction success was measured (e.g., exact pass/fail criteria on MATH problems or HumanEval test cases), the precise baselines used for comparison, statistical significance tests, or variance across runs. This detail is load-bearing for the claim that SCoRe achieves effective test-time self-correction rather than benchmark-specific gains.
  2. [Experimental results section] Experimental results section: The regularization components (initial multi-turn RL for policy initialization and reward bonus) are described as addressing behavior collapse, yet no ablation is reported that isolates the contribution of each component or tests whether the learned policy generalizes to novel error types or out-of-distribution problems. Without such controls, the reported gains could reflect fitting to high-reward traces seen during training rather than acquisition of a general correction capability.
  3. [Method section] Method section: The reward bonus is listed among the free parameters, but the text provides no sensitivity analysis, default value, or justification for its scale. This leaves open whether the reported improvements depend on careful tuning of this hyperparameter or hold across reasonable choices.
minor comments (1)
  1. [Abstract and Method] The abstract and method descriptions use the term 'self-correction behavior' without a precise operational definition (e.g., whether it requires the model to detect and fix its own errors in a single additional turn or over multiple turns). Adding a short formal definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We appreciate the opportunity to address the concerns regarding clarity in the abstract, the need for additional controls in the experiments, and hyperparameter details. We believe these points can be resolved through targeted revisions and clarifications without altering the core contributions of the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central quantitative claim of 15.6% and 9.1% improvements is presented without any description of how self-correction success was measured (e.g., exact pass/fail criteria on MATH problems or HumanEval test cases), the precise baselines used for comparison, statistical significance tests, or variance across runs. This detail is load-bearing for the claim that SCoRe achieves effective test-time self-correction rather than benchmark-specific gains.

    Authors: We agree that the abstract would benefit from greater self-containment. In the revised manuscript we will expand the abstract to briefly specify the success metrics (exact answer match after correction attempts on MATH; standard pass@1 on HumanEval test cases), the primary baselines (base model and SFT variants), and that gains are measured under the model's own self-correction loop at test time. Detailed protocols appear in the experimental setup; we will add a short clause noting that results reflect consistent trends across the reported model scales. Formal statistical significance testing and multi-seed variance were not computed in the original experiments owing to the scale of the RL runs, but we will note this limitation explicitly. revision: yes

  2. Referee: [Experimental results section] Experimental results section: The regularization components (initial multi-turn RL for policy initialization and reward bonus) are described as addressing behavior collapse, yet no ablation is reported that isolates the contribution of each component or tests whether the learned policy generalizes to novel error types or out-of-distribution problems. Without such controls, the reported gains could reflect fitting to high-reward traces seen during training rather than acquisition of a general correction capability.

    Authors: We acknowledge the value of isolating the regularization components. The current results already contrast full SCoRe against SFT baselines, which indirectly highlights the benefit of online training under the model's own distribution. In revision we will add explicit ablations that remove the policy-initialization phase and the reward bonus in turn, reporting the resulting performance drop. On generalization, evaluations use held-out test splits containing diverse error patterns; however, dedicated probes for entirely novel error types or strong distribution shifts were not performed. We will add a discussion of this scope limitation and list targeted generalization experiments as future work. revision: partial

  3. Referee: [Method section] Method section: The reward bonus is listed among the free parameters, but the text provides no sensitivity analysis, default value, or justification for its scale. This leaves open whether the reported improvements depend on careful tuning of this hyperparameter or hold across reasonable choices.

    Authors: We will revise the method section to state the default bonus value employed in the main experiments and the preliminary tuning procedure used to select it (balancing correction encouragement against the primary outcome reward). We will also include a short sensitivity table or plot showing performance for a modest range of bonus magnitudes around the chosen value, confirming that gains remain stable within that range. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SCoRe's empirical RL derivation

full rationale

The paper's core contribution is an empirical multi-turn online RL procedure (SCoRe) that trains on the model's own self-generated correction traces, initialized via a preliminary RL phase and regularized with a reward bonus. Reported gains (15.6% on MATH, 9.1% on HumanEval) are measured outcomes on external held-out benchmarks after training, not quantities defined by or equivalent to the training objective itself. SFT failure modes are diagnosed via direct experimental observation of distribution mismatch and collapse, not by definitional reduction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the central result; the method remains falsifiable against independent test distributions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions about reward signals guiding policy improvement and the premise that self-generated traces can serve as effective training data when properly regularized; no new entities are postulated.

free parameters (1)
  • reward bonus scale
    Used to amplify self-correction; specific value not reported in abstract.
axioms (1)
  • domain assumption Multi-turn interactions can be modeled as a Markov decision process for LLM correction
    Invoked implicitly when framing self-correction as multi-turn RL.

pith-pipeline@v0.9.0 · 5678 in / 1191 out tokens · 79062 ms · 2026-05-17T11:57:35.333634+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt.

  • IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction.

  • IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion

    cs.LG 2026-05 unverdicted novelty 7.0

    Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...

  2. Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    cs.CV 2025-03 unverdicted novelty 7.0

    Seg-Zero uses cognitive reinforcement learning on a decoupled reasoning-plus-segmentation architecture to produce explicit reasoning chains and reach 57.5 zero-shot accuracy on ReasonSeg, beating prior supervised LISA...

  3. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    cs.CL 2024-12 unverdicted novelty 7.0

    o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.

  4. Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

    cs.CL 2026-05 unverdicted novelty 6.0

    CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.

  5. A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability

    cs.LG 2026-05 unverdicted novelty 6.0

    LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.

  6. Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

    cs.AI 2026-05 unverdicted novelty 6.0

    Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.

  7. PaT: Planning-after-Trial for Efficient Test-Time Code Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.

  8. Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture

    cs.SE 2026-05 unverdicted novelty 6.0

    RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...

  9. When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

    cs.AI 2026-04 unverdicted novelty 6.0

    A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.

  10. Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.

  11. REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

    cs.CV 2025-11 unverdicted novelty 6.0

    REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.

  12. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    cs.CV 2025-10 conditional novelty 6.0

    Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.

  13. HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

    cs.CL 2024-12 unverdicted novelty 6.0

    HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.

  14. CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation

    cs.CL 2026-04 unverdicted novelty 5.0

    CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.

  15. Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    cs.LG 2026-04 unverdicted novelty 5.0

    A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

  16. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

  17. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

    cs.AI 2025-01 unverdicted novelty 3.0

    The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

Reference graph

Works this paper leans on

282 extracted references · 282 canonical work pages · cited by 17 Pith papers · 64 internal anchors

  1. [1]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    A. Ahmadian, C. Cremer, M. Gall \'e , M. Fadaee, J. Kreutzer, A. \"U st \"u n, and S. Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024

  2. [2]

    u rek, E. Aky \

    A. F. Aky \"u rek, E. Aky \"u rek, A. Madaan, A. Kalyan, P. Clark, D. Wijaya, and N. Tandon. Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs. arXiv preprint arXiv:2305.08844, 2023

  3. [3]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  4. [5]

    X. Chen, M. Lin, N. Sch \"a rli, and D. Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023

  5. [7]

    Teaching large language models to reason with reinforcement learning

    A. Havrilla, Y. Du, S. C. Raparthy, C. Nalmpantis, J. Dwivedi-Yu, M. Zhuravinskyi, E. Hambro, S. Sukhbaatar, and R. Raileanu. Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642, 2024 a

  6. [9]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

  7. [10]

    J. Hong, N. Lee, and J. Thorne. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691, 2024

  8. [11]

    Large Language Models Cannot Self-Correct Reasoning Yet

    J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and D. Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023

  9. [12]

    N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  10. [14]

    G. Kim, P. Baldi, and S. McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023

  11. [17]

    StarCoder 2 and The Stack v2: The Next Generation

    A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024

  12. [19]

    A. Ni, M. Allamanis, A. Cohan, Y. Deng, K. Shi, C. Sutton, and P. Yin. Next: Teaching large language models to reason about code execution. arXiv preprint arXiv:2404.14662, 2024

  13. [20]

    T. X. Olausson, J. P. Inala, C. Wang, J. Gao, and A. Solar-Lezama. Is self-repair a silver bullet for code generation? In The Twelfth International Conference on Learning Representations, 2023

  14. [21]

    L. Pan, M. Saxon, W. Xu, D. Nathani, X. Wang, and W. Y. Wang. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188, 2023

  15. [22]

    D. Paul, M. Ismayilzada, M. Peyrard, B. Borges, A. Bosselut, R. West, and B. Faltings. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904, 2023

  16. [23]

    Y. Qu, T. Zhang, N. Garg, and A. Kumar. Recursive introspection: Teaching language model agents how to self-improve. arXiv preprint arXiv:2407.18219, 2024

  17. [27]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    N. Shinn, B. Labash, and A. Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

  18. [29]

    Singh, J

    A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, X. Garcia, P. J. Liu, J. Harrison, J. Lee, K. Xu, A. Parisi, A. Kumar, A. Alemi, A. Rizkowsky, A. Nova, B. Adlam, B. Bohnet, G. Elsayed, H. Sedghi, I. Mordatch, I. Simpson, I. Gur, J. Snoek, J. Pennington, J. Hron, K. Kenealy, K. Swersky, K. Mahajan, L. Culp, L. Xiao, M. L. Bileschi, N. Constant, R...

  19. [31]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

  20. [32]

    C. Team. Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409, 2024

  21. [33]

    G. Tyen, H. Mansoor, V. C a rbune, Y. P. Chen, and T. Mak. Llms cannot find reasoning errors, but can correct them given the error location. In Findings of the Association for Computational Linguistics ACL 2024, pages 13894--13908, 2024

  22. [34]

    Solving math word problems with process- and outcome-based feedback

    J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022

  23. [36]

    Welleck, X

    S. Welleck, X. Lu, P. West, F. Brahman, T. Shen, D. Khashabi, and Y. Choi. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=hH36JeQZDaO

  24. [39]

    S. Ye, Y. Jo, D. Kim, S. Kim, H. Hwang, and M. Seo. Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post, 2023

  25. [40]

    T. Ye, Z. Xu, Y. Li, and Z. Allen-Zhu. Physics of language models: Part 2.2, how to learn from mistakes on grade-school math problems, 2024. URL https://arxiv.org/abs/2408.16293

  26. [42]

    Zelikman, Y

    E. Zelikman, Y. Wu, J. Mu, and N. Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

  27. [45]

    Y. Zhou, A. Zanette, J. Pan, S. Levine, and A. Kumar. Archer: Training language model agents via hierarchical multi-turn rl. arXiv preprint arXiv:2402.19446, 2024

  28. [46]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  29. [47]

    Publications Manual , year = "1983", publisher =

  30. [48]

    Do large language models latently perform multi-hop reasoning?arXiv preprint arXiv:2402.16837,

    Do Large Language Models Latently Perform Multi-Hop Reasoning? , author=. arXiv preprint arXiv:2402.16837 , year=

  31. [49]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

  32. [50]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  33. [51]

    arXiv preprint arXiv:2405.14655 , year=

    Multi-turn Reinforcement Learning from Preference Human Feedback , author=. arXiv preprint arXiv:2405.14655 , year=

  34. [52]

    arXiv preprint arXiv:2409.02392 , year=

    Building Math Agents with Multi-Turn Iterative Preference Learning , author=. arXiv preprint arXiv:2409.02392 , year=

  35. [53]

    arXiv preprint arXiv:2403.03950 , year=

    Stop regressing: Training value functions via classification for scalable deep rl , author=. arXiv preprint arXiv:2403.03950 , year=

  36. [54]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  37. [55]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024a

    Small Language Models Need Strong Verifiers to Self-Correct Reasoning , author=. arXiv preprint arXiv:2404.17140 , year=

  38. [56]

    Advances in neural information processing systems , volume=

    Causal confusion in imitation learning , author=. Advances in neural information processing systems , volume=

  39. [57]

    Self-critiquing models for assisting human evaluators

    Self-critiquing models for assisting human evaluators , author=. arXiv preprint arXiv:2206.05802 , year=

  40. [58]

    Blog post , year=

    Selfee: Iterative self-revising llm empowered by self-feedback generation , author=. Blog post , year=

  41. [59]

    arXiv preprint arXiv:2406.04520 , year=

    NATURAL PLAN: Benchmarking LLMs on Natural Language Planning , author=. arXiv preprint arXiv:2406.04520 , year=

  42. [60]

    Findings of the Association for Computational Linguistics ACL 2024 , pages=

    LLMs cannot find reasoning errors, but can correct them given the error location , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

  43. [61]

    arXiv preprint arXiv:2406.01297 , year=

    When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs , author=. arXiv preprint arXiv:2406.01297 , year=

  44. [62]

    arXiv preprint arXiv:2312.06585 , year=

    Beyond human data: Scaling self-training for problem-solving with language models , author=. arXiv preprint arXiv:2312.06585 , year=

  45. [63]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  46. [64]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  47. [65]

    Dan Gusfield , title =. 1997

  48. [66]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  49. [67]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  50. [68]

    Advances in Neural Information Processing Systems , volume=

    Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  51. [69]

    2022 , eprint=

    Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

  52. [70]

    Advances in Neural Information Processing Systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

  53. [71]

    Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

    Noam Shazeer and Mitchell Stern , title =. CoRR , volume =. 2018 , url =. 1804.04235 , timestamp =

  54. [72]

    Asynchronous Methods for Deep Reinforcement Learning

    Volodymyr Mnih and Adri. Asynchronous Methods for Deep Reinforcement Learning , journal =. 2016 , url =. 1602.01783 , timestamp =

  55. [73]

    2023 , eprint=

    PaLM 2 Technical Report , author=. 2023 , eprint=

  56. [74]

    Hierarchical Neural Story Generation

    Fan, Angela and Lewis, Mike and Dauphin, Yann. Hierarchical Neural Story Generation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1082

  57. [75]

    LaMDA: Language Models for Dialog Applications

    Lamda: Language models for dialog applications , author=. arXiv preprint arXiv:2201.08239 , year=

  58. [76]

    PaLM: Scaling Language Modeling with Pathways

    Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

  59. [77]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  60. [78]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  61. [79]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

  62. [80]

    WebGPT: Browser-assisted question-answering with human feedback

    Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

  63. [81]

    Proceedings of the AAAI Conference on Artificial Intelligence , pages=

    Learning to extract coherent summary via deep reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=

  64. [82]

    Improving alignment of dialogue agents via targeted human judgements

    Improving alignment of dialogue agents via targeted human judgements , author=. arXiv preprint arXiv:2209.14375 , year=

  65. [83]

    arXiv preprint arXiv:2307.16039 , year=

    Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback , author=. arXiv preprint arXiv:2307.16039 , year=

  66. [84]

    arXiv preprint arXiv:1907.12894 , year=

    Reward learning for efficient reinforcement learning in extractive document summarisation , author=. arXiv preprint arXiv:1907.12894 , year=

  67. [85]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

    A Study of Reinforcement Learning for Neural Machine Translation , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

  68. [86]

    arXiv preprint arXiv:2304.01852 , year=

    Summary of chatgpt/gpt-4 research and perspective towards the future of large language models , author=. arXiv preprint arXiv:2304.01852 , year=

  69. [87]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  70. [88]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=

  71. [89]

    Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

    Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. arXiv preprint arXiv:2404.05868 , year=

  72. [90]

    Direct nash optimization: Teaching language models to self-improve with general preferences.arXiv preprint arXiv:2404.03715,

    Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences , author=. arXiv preprint arXiv:2404.03715 , year=

  73. [91]

    2023 , note =

    An overview of Bard: an early experiment with generative AI , author=. 2023 , note =

  74. [92]

    Advances in Neural Information Processing Systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

  75. [93]

    2023 , note =

    OpenAI Pricing , author =. 2023 , note =

  76. [94]

    2023 , note =

    AI Platform Data Labeling Service pricing , author =. 2023 , note =

  77. [95]

    Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

    Want To Reduce Labeling Cost? GPT-3 Can Help , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

  78. [96]

    arXiv preprint arXiv:2303.15056 , year=

    Chatgpt outperforms crowd-workers for text-annotation tasks , author=. arXiv preprint arXiv:2303.15056 , year=

  79. [97]

    Is GPT -3 a Good Data Annotator?

    Ding, Bosheng and Qin, Chengwei and Liu, Linlin and Chia, Yew Ken and Li, Boyang and Joty, Shafiq and Bing, Lidong. Is GPT -3 a Good Data Annotator?. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.626

  80. [98]

    2023 , eprint=

    RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment , author=. 2023 , eprint=

Showing first 80 references.