ElasticMem enables LLM agents to learn adaptive latent memory retrieval and elastic budget allocation, improving QA accuracy by 24-26% and ALFWorld success by 27-66% over baselines with lower token cost.
hub Canonical reference
Training Language Models to Self-Correct via Reinforcement Learning
Canonical reference. 71% of citing Pith papers cite this work as background.
abstract
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
hub tools
citation-role summary
citation-polarity summary
roles
background 7representative citing papers
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reasoning tasks.
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
PROPEL amortizes solver evaluation with a trained activation probe to optimize task generators toward a target solve rate, raising the share of learnable tasks from ~10% to ~20% in coding and SWE experiments.
MemoPilot trains memory updates for LLM agents via multi-turn GRPO on RPS and poker, achieving top Elo scores and outperforming baselines including DeepSeek-V3.2.
CATPO introduces an informativeness score F(T) and critique-guided healing for failed trees to improve efficiency and performance in tree-based RLVR, reaching 37.5% macro accuracy on math benchmarks.
DyCon dynamically controls reasoning depth in LRMs by modeling evolving difficulty from step-level embeddings, reducing redundant steps across multiple benchmarks.
Weak models used as critics supplying non-misleading revision directions, distilled on-policy via OPCD, improve frozen and trained strong models on reasoning and alignment benchmarks.
DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.
ExpGraph builds a graph of summarized agent experiences and uses graph diffusion plus an RL-trained retrieval copilot to improve frozen LLM executors on QA, math, code, and agentic tasks without parameter updates.
CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.
LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
MapAgent augments vectorized lane mapping with a bounded verification-driven agent loop that diagnoses specification violations and applies minimal edits, reaching over 95% automation in Baidu Maps production across 360+ cities.
citing papers explorer
No citing papers match the current filters.