Training Language Models to Self-Correct via Reinforcement Learning
Pith reviewed 2026-05-17 11:57 UTC · model grok-4.3
The pith
Multi-turn reinforcement learning trains language models to self-correct using only their own generated data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SCoRe performs multi-turn online reinforcement learning directly on traces the model generates itself, starting with a policy-initialization phase of RL followed by a reward bonus that amplifies effective self-correction, thereby avoiding the distribution mismatch and behavior collapse that limit supervised fine-tuning on offline correction data.
What carries the argument
SCoRe, the multi-turn online RL procedure that trains under the model's own distribution of correction traces and uses phased initialization plus reward regularization to produce generalizable self-correction behavior.
If this is right
- SCoRe raises self-correction rates by 15.6% on MATH with Gemini 1.0 Pro and by 9.1% on HumanEval with Gemini 1.5 Flash.
- The method achieves state-of-the-art self-correction results using only self-generated data and no external models or supervision.
- Variants of supervised fine-tuning on model-generated traces are shown to be insufficient for instilling reliable self-correction.
- Regularization via an initial multi-turn RL phase and a reward bonus steers learning away from collapse into ineffective modes.
Where Pith is reading between the lines
- The same self-generated-data RL pattern could be tested on other iterative tasks such as multi-step reasoning or code debugging.
- If the regularization prevents collapse here, similar bonuses might stabilize other multi-turn RL applications in language models.
- Applying SCoRe to open-source base models would test whether the reported gains require proprietary model families.
Load-bearing premise
Training under the model's own distribution of self-generated correction traces plus the described regularization will produce correction behavior that works on unseen test problems instead of high-reward but non-generalizable patterns.
What would settle it
Measuring self-correction accuracy on MATH after running SCoRe on a held-out set of problems; if the accuracy gain disappears or reverses relative to the base model, the central claim does not hold.
read the original abstract
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SCoRe, a multi-turn online reinforcement learning method for training LLMs to self-correct using only self-generated data. It first shows that supervised fine-tuning (SFT) on offline model-generated correction traces often fails due to distribution mismatch between the data-collection policy and the model's responses or due to behavior collapse into ineffective correction modes. SCoRe mitigates this via online RL under the model's own trace distribution, combined with regularization consisting of an initial multi-turn RL phase to initialize a more stable policy followed by a reward bonus to encourage effective self-correction. Experiments on Gemini 1.0 Pro and 1.5 Flash report state-of-the-art self-correction, with absolute improvements of 15.6% and 9.1% on MATH and HumanEval respectively.
Significance. If the empirical gains prove robust under proper controls and generalize beyond the training distribution, the work would be significant for providing a scalable, supervision-light approach to instill self-correction in LLMs. The explicit treatment of distribution mismatch and collapse via online training plus targeted regularization (policy initialization + reward bonus) offers a concrete methodological advance over prior SFT-based attempts, with potential applicability to reasoning and code-generation tasks.
major comments (3)
- [Abstract] Abstract: The central quantitative claim of 15.6% and 9.1% improvements is presented without any description of how self-correction success was measured (e.g., exact pass/fail criteria on MATH problems or HumanEval test cases), the precise baselines used for comparison, statistical significance tests, or variance across runs. This detail is load-bearing for the claim that SCoRe achieves effective test-time self-correction rather than benchmark-specific gains.
- [Experimental results section] Experimental results section: The regularization components (initial multi-turn RL for policy initialization and reward bonus) are described as addressing behavior collapse, yet no ablation is reported that isolates the contribution of each component or tests whether the learned policy generalizes to novel error types or out-of-distribution problems. Without such controls, the reported gains could reflect fitting to high-reward traces seen during training rather than acquisition of a general correction capability.
- [Method section] Method section: The reward bonus is listed among the free parameters, but the text provides no sensitivity analysis, default value, or justification for its scale. This leaves open whether the reported improvements depend on careful tuning of this hyperparameter or hold across reasonable choices.
minor comments (1)
- [Abstract and Method] The abstract and method descriptions use the term 'self-correction behavior' without a precise operational definition (e.g., whether it requires the model to detect and fix its own errors in a single additional turn or over multiple turns). Adding a short formal definition would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We appreciate the opportunity to address the concerns regarding clarity in the abstract, the need for additional controls in the experiments, and hyperparameter details. We believe these points can be resolved through targeted revisions and clarifications without altering the core contributions of the work.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central quantitative claim of 15.6% and 9.1% improvements is presented without any description of how self-correction success was measured (e.g., exact pass/fail criteria on MATH problems or HumanEval test cases), the precise baselines used for comparison, statistical significance tests, or variance across runs. This detail is load-bearing for the claim that SCoRe achieves effective test-time self-correction rather than benchmark-specific gains.
Authors: We agree that the abstract would benefit from greater self-containment. In the revised manuscript we will expand the abstract to briefly specify the success metrics (exact answer match after correction attempts on MATH; standard pass@1 on HumanEval test cases), the primary baselines (base model and SFT variants), and that gains are measured under the model's own self-correction loop at test time. Detailed protocols appear in the experimental setup; we will add a short clause noting that results reflect consistent trends across the reported model scales. Formal statistical significance testing and multi-seed variance were not computed in the original experiments owing to the scale of the RL runs, but we will note this limitation explicitly. revision: yes
-
Referee: [Experimental results section] Experimental results section: The regularization components (initial multi-turn RL for policy initialization and reward bonus) are described as addressing behavior collapse, yet no ablation is reported that isolates the contribution of each component or tests whether the learned policy generalizes to novel error types or out-of-distribution problems. Without such controls, the reported gains could reflect fitting to high-reward traces seen during training rather than acquisition of a general correction capability.
Authors: We acknowledge the value of isolating the regularization components. The current results already contrast full SCoRe against SFT baselines, which indirectly highlights the benefit of online training under the model's own distribution. In revision we will add explicit ablations that remove the policy-initialization phase and the reward bonus in turn, reporting the resulting performance drop. On generalization, evaluations use held-out test splits containing diverse error patterns; however, dedicated probes for entirely novel error types or strong distribution shifts were not performed. We will add a discussion of this scope limitation and list targeted generalization experiments as future work. revision: partial
-
Referee: [Method section] Method section: The reward bonus is listed among the free parameters, but the text provides no sensitivity analysis, default value, or justification for its scale. This leaves open whether the reported improvements depend on careful tuning of this hyperparameter or hold across reasonable choices.
Authors: We will revise the method section to state the default bonus value employed in the main experiments and the preliminary tuning procedure used to select it (balancing correction encouragement against the primary outcome reward). We will also include a short sensitivity table or plot showing performance for a modest range of bonus magnitudes around the chosen value, confirming that gains remain stable within that range. revision: yes
Circularity Check
No significant circularity in SCoRe's empirical RL derivation
full rationale
The paper's core contribution is an empirical multi-turn online RL procedure (SCoRe) that trains on the model's own self-generated correction traces, initialized via a preliminary RL phase and regularized with a reward bonus. Reported gains (15.6% on MATH, 9.1% on HumanEval) are measured outcomes on external held-out benchmarks after training, not quantities defined by or equivalent to the training objective itself. SFT failure modes are diagnosed via direct experimental observation of distribution mismatch and collapse, not by definitional reduction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the central result; the method remains falsifiable against independent test distributions.
Axiom & Free-Parameter Ledger
free parameters (1)
- reward bonus scale
axioms (1)
- domain assumption Multi-turn interactions can be modeled as a Markov decision process for LLM correction
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt.
-
IndisputableMonolith.Foundation.LedgerForcingconservation_from_balance unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
-
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Seg-Zero uses cognitive reinforcement learning on a decoupled reasoning-plus-segmentation architecture to produce explicit reasoning chains and reach 57.5 zero-shot accuracy on ReasonSeg, beating prior supervised LISA...
-
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
-
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
-
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
-
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
-
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
-
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
-
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
-
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
-
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.
-
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
-
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
-
CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation
CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
Reference graph
Works this paper leans on
-
[1]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
A. Ahmadian, C. Cremer, M. Gall \'e , M. Fadaee, J. Kreutzer, A. \"U st \"u n, and S. Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
A. F. Aky \"u rek, E. Aky \"u rek, A. Madaan, A. Kalyan, P. Clark, D. Wijaya, and N. Tandon. Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs. arXiv preprint arXiv:2305.08844, 2023
-
[3]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
X. Chen, M. Lin, N. Sch \"a rli, and D. Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Teaching large language models to reason with reinforcement learning
A. Havrilla, Y. Du, S. C. Raparthy, C. Nalmpantis, J. Dwivedi-Yu, M. Zhuravinskyi, E. Hambro, S. Sukhbaatar, and R. Raileanu. Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642, 2024 a
-
[9]
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021
work page 2021
-
[10]
J. Hong, N. Lee, and J. Thorne. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Large Language Models Cannot Self-Correct Reasoning Yet
J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and D. Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
G. Kim, P. Baldi, and S. McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023
work page internal anchor Pith review arXiv 2023
-
[17]
StarCoder 2 and The Stack v2: The Next Generation
A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [19]
-
[20]
T. X. Olausson, J. P. Inala, C. Wang, J. Gao, and A. Solar-Lezama. Is self-repair a silver bullet for code generation? In The Twelfth International Conference on Learning Representations, 2023
work page 2023
- [21]
- [22]
- [23]
-
[27]
Reflexion: Language Agents with Verbal Reinforcement Learning
N. Shinn, B. Labash, and A. Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, X. Garcia, P. J. Liu, J. Harrison, J. Lee, K. Xu, A. Parisi, A. Kumar, A. Alemi, A. Rizkowsky, A. Nova, B. Adlam, B. Bohnet, G. Elsayed, H. Sedghi, I. Mordatch, I. Simpson, I. Gur, J. Snoek, J. Pennington, J. Hron, K. Kenealy, K. Swersky, K. Mahajan, L. Culp, L. Xiao, M. L. Bileschi, N. Constant, R...
work page 2024
-
[31]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [32]
-
[33]
G. Tyen, H. Mansoor, V. C a rbune, Y. P. Chen, and T. Mak. Llms cannot find reasoning errors, but can correct them given the error location. In Findings of the Association for Computational Linguistics ACL 2024, pages 13894--13908, 2024
work page 2024
-
[34]
Solving math word problems with process- and outcome-based feedback
J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
S. Welleck, X. Lu, P. West, F. Brahman, T. Shen, D. Khashabi, and Y. Choi. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=hH36JeQZDaO
work page 2023
-
[39]
S. Ye, Y. Jo, D. Kim, S. Kim, H. Hwang, and M. Seo. Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post, 2023
work page 2023
- [40]
-
[42]
E. Zelikman, Y. Wu, J. Mu, and N. Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022
work page 2022
- [45]
- [46]
-
[47]
Publications Manual , year = "1983", publisher =
work page 1983
-
[48]
Do large language models latently perform multi-hop reasoning?arXiv preprint arXiv:2402.16837,
Do Large Language Models Latently Perform Multi-Hop Reasoning? , author=. arXiv preprint arXiv:2402.16837 , year=
-
[49]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
arXiv preprint arXiv:2405.14655 , year=
Multi-turn Reinforcement Learning from Preference Human Feedback , author=. arXiv preprint arXiv:2405.14655 , year=
-
[52]
arXiv preprint arXiv:2409.02392 , year=
Building Math Agents with Multi-Turn Iterative Preference Learning , author=. arXiv preprint arXiv:2409.02392 , year=
-
[53]
arXiv preprint arXiv:2403.03950 , year=
Stop regressing: Training value functions via classification for scalable deep rl , author=. arXiv preprint arXiv:2403.03950 , year=
-
[54]
Evaluating Large Language Models Trained on Code
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024a
Small Language Models Need Strong Verifiers to Self-Correct Reasoning , author=. arXiv preprint arXiv:2404.17140 , year=
-
[56]
Advances in neural information processing systems , volume=
Causal confusion in imitation learning , author=. Advances in neural information processing systems , volume=
-
[57]
Self-critiquing models for assisting human evaluators
Self-critiquing models for assisting human evaluators , author=. arXiv preprint arXiv:2206.05802 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Selfee: Iterative self-revising llm empowered by self-feedback generation , author=. Blog post , year=
-
[59]
arXiv preprint arXiv:2406.04520 , year=
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning , author=. arXiv preprint arXiv:2406.04520 , year=
-
[60]
Findings of the Association for Computational Linguistics ACL 2024 , pages=
LLMs cannot find reasoning errors, but can correct them given the error location , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=
work page 2024
-
[61]
arXiv preprint arXiv:2406.01297 , year=
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs , author=. arXiv preprint arXiv:2406.01297 , year=
-
[62]
arXiv preprint arXiv:2312.06585 , year=
Beyond human data: Scaling self-training for problem-solving with language models , author=. arXiv preprint arXiv:2312.06585 , year=
-
[63]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
- [64]
-
[65]
Dan Gusfield , title =. 1997
work page 1997
-
[66]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[67]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[68]
Advances in Neural Information Processing Systems , volume=
Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[69]
Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=
work page 2022
-
[70]
Advances in Neural Information Processing Systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[71]
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Noam Shazeer and Mitchell Stern , title =. CoRR , volume =. 2018 , url =. 1804.04235 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[72]
Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih and Adri. Asynchronous Methods for Deep Reinforcement Learning , journal =. 2016 , url =. 1602.01783 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [73]
-
[74]
Hierarchical Neural Story Generation
Fan, Angela and Lewis, Mike and Dauphin, Yann. Hierarchical Neural Story Generation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1082
-
[75]
LaMDA: Language Models for Dialog Applications
Lamda: Language models for dialog applications , author=. arXiv preprint arXiv:2201.08239 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
PaLM: Scaling Language Modeling with Pathways
Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[77]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[78]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[79]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
WebGPT: Browser-assisted question-answering with human feedback
Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[81]
Proceedings of the AAAI Conference on Artificial Intelligence , pages=
Learning to extract coherent summary via deep reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=
-
[82]
Improving alignment of dialogue agents via targeted human judgements
Improving alignment of dialogue agents via targeted human judgements , author=. arXiv preprint arXiv:2209.14375 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[83]
arXiv preprint arXiv:2307.16039 , year=
Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback , author=. arXiv preprint arXiv:2307.16039 , year=
-
[84]
arXiv preprint arXiv:1907.12894 , year=
Reward learning for efficient reinforcement learning in extractive document summarisation , author=. arXiv preprint arXiv:1907.12894 , year=
-
[85]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
A Study of Reinforcement Learning for Neural Machine Translation , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2018
-
[86]
arXiv preprint arXiv:2304.01852 , year=
Summary of chatgpt/gpt-4 research and perspective towards the future of large language models , author=. arXiv preprint arXiv:2304.01852 , year=
- [87]
-
[88]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[89]
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. arXiv preprint arXiv:2404.05868 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[90]
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences , author=. arXiv preprint arXiv:2404.03715 , year=
-
[91]
An overview of Bard: an early experiment with generative AI , author=. 2023 , note =
work page 2023
-
[92]
Advances in Neural Information Processing Systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=
- [93]
- [94]
-
[95]
Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=
Want To Reduce Labeling Cost? GPT-3 Can Help , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=
work page 2021
-
[96]
arXiv preprint arXiv:2303.15056 , year=
Chatgpt outperforms crowd-workers for text-annotation tasks , author=. arXiv preprint arXiv:2303.15056 , year=
-
[97]
Is GPT -3 a Good Data Annotator?
Ding, Bosheng and Qin, Chengwei and Liu, Linlin and Chia, Yew Ken and Li, Boyang and Joty, Shafiq and Bing, Lidong. Is GPT -3 a Good Data Annotator?. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.626
-
[98]
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment , author=. 2023 , eprint=
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.