hub Mixed citations

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, Noah D. Goodman · 2025 · cs.CL · arXiv 2503.01307

Mixed citation behavior. Most common role is background (57%).

30 Pith papers citing it

Background 57% of classified citations

open full Pith review browse 30 citing papers arXiv PDF

abstract

Test-time inference has emerged as a powerful paradigm for enabling language models to ``think'' longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors -- verification, backtracking, subgoal setting, and backward chaining -- that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen's performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor -- models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen's self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 1

citation-polarity summary

background 4 support 1 unclear 1 use method 1

representative citing papers

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

cs.CL · 2025-04-15 · conditional · novelty 8.0

DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

cs.SE · 2026-05-12 · unverdicted · novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency

cs.LG · 2026-01-29 · unverdicted · novelty 7.0

Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

cs.LG · 2025-07-02 · unverdicted · novelty 7.0

Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

cs.LG · 2025-04-29 · accept · novelty 7.0

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

Self-Supervised On-Policy Distillation for Reasoning Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.

Reasoning Can Be Restored by Correcting a Few Decision Tokens

cs.AI · 2026-05-16 · conditional · novelty 6.0

Reasoning gaps between base LLMs and LRMs concentrate on ~8% of early planning tokens; intervening with the reasoning model only at high-disagreement positions recovers performance.

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.

How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.

Understanding the Mechanism of Altruism in Large Language Models

econ.GN · 2026-04-21 · unverdicted · novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.

Rectifying LLM Thought from Lens of Optimization

cs.CL · 2025-12-01 · unverdicted · novelty 6.0

RePro defines a surrogate objective with intensity and stability scores to generate process-level rewards that enhance LLM reasoning efficiency and accuracy within RLVR pipelines.

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

cs.CV · 2025-11-25 · unverdicted · novelty 6.0

SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

cs.CV · 2025-11-17 · unverdicted · novelty 6.0

REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

cs.CL · 2025-07-21 · unverdicted · novelty 6.0

Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

cs.CL · 2025-06-02 · conditional · novelty 6.0

High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

cs.CL · 2025-05-30 · conditional · novelty 6.0

Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.

Grounded Reinforcement Learning for Visual Reasoning

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

cs.LG · 2025-05-21 · unverdicted · novelty 6.0

Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.

Evaluating the False Trust Engendered by LLM Explanations

cs.HC · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

cs.AI · 2026-05-08 · unverdicted · novelty 5.0

Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.

A Multi-Agent Framework for Automated Exploit Generation with Constraint-Guided Comprehension and Reflection

cs.SE · 2026-04-06 · unverdicted · novelty 5.0

Vulnsage, a multi-agent framework, generates 34.64% more exploits than prior tools and verified 146 zero-day vulnerabilities in real-world open-source libraries.

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

cs.CL · 2025-10-07 · unverdicted · novelty 5.0

EEPO uses sample-then-forget rollouts with adaptive unlearning to boost exploration in RLVR, delivering relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base over GRPO across five reasoning benchmarks.

Self-Aligned Reward: Towards Effective and Efficient Reasoners

cs.LG · 2025-09-05 · unverdicted · novelty 5.0

Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.

citing papers explorer

Showing 30 of 30 citing papers.

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning cs.CL · 2025-04-15 · conditional · none · ref 9 · internal anchor
DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning cs.SE · 2026-05-12 · unverdicted · none · ref 61 · internal anchor
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency cs.LG · 2026-01-29 · unverdicted · none · ref 8 · internal anchor
Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling cs.LG · 2025-07-02 · unverdicted · none · ref 10 · internal anchor
Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example cs.LG · 2025-04-29 · accept · none · ref 6 · internal anchor
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
Self-Supervised On-Policy Distillation for Reasoning Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 79 · internal anchor
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
Reasoning Can Be Restored by Correcting a Few Decision Tokens cs.AI · 2026-05-16 · conditional · none · ref 5 · internal anchor
Reasoning gaps between base LLMs and LRMs concentrate on ~8% of early planning tokens; intervening with the reasoning model only at high-disagreement positions recovers performance.
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes cs.CL · 2026-05-13 · unverdicted · none · ref 55 · internal anchor
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals cs.LG · 2026-04-24 · unverdicted · none · ref 5 · internal anchor
LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
Understanding the Mechanism of Altruism in Large Language Models econ.GN · 2026-04-21 · unverdicted · none · ref 154 · internal anchor
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
Rectifying LLM Thought from Lens of Optimization cs.CL · 2025-12-01 · unverdicted · none · ref 4 · internal anchor
RePro defines a surrogate objective with intensity and stability scores to generate process-level rewards that enhance LLM reasoning efficiency and accuracy within RLVR pipelines.
SPHINX: A Synthetic Environment for Visual Perception and Reasoning cs.CV · 2025-11-25 · unverdicted · none · ref 17 · internal anchor
SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding cs.CV · 2025-11-17 · unverdicted · none · ref 10 · internal anchor
REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR cs.CL · 2025-07-21 · unverdicted · none · ref 10 · internal anchor
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning cs.CL · 2025-06-02 · conditional · none · ref 6 · internal anchor
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models cs.CL · 2025-05-30 · conditional · none · ref 56 · internal anchor
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
Grounded Reinforcement Learning for Visual Reasoning cs.CV · 2025-05-29 · unverdicted · none · ref 15 · internal anchor
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning cs.LG · 2025-05-21 · unverdicted · none · ref 22 · internal anchor
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance cs.CL · 2026-05-21 · unverdicted · none · ref 73 · internal anchor
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
Evaluating the False Trust Engendered by LLM Explanations cs.HC · 2026-05-11 · unverdicted · none · ref 51 · 2 links · internal anchor
LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models cs.AI · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
A Multi-Agent Framework for Automated Exploit Generation with Constraint-Guided Comprehension and Reflection cs.SE · 2026-04-06 · unverdicted · none · ref 17 · internal anchor
Vulnsage, a multi-agent framework, generates 34.64% more exploits than prior tools and verified 146 zero-day vulnerabilities in real-world open-source libraries.
EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget cs.CL · 2025-10-07 · unverdicted · none · ref 6 · internal anchor
EEPO uses sample-then-forget rollouts with adaptive unlearning to boost exploration in RLVR, delivering relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base over GRPO across five reasoning benchmarks.
Self-Aligned Reward: Towards Effective and Efficient Reasoners cs.LG · 2025-09-05 · unverdicted · none · ref 12 · internal anchor
Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
Proximal Supervised Fine-Tuning cs.LG · 2025-08-25 · unverdicted · none · ref 7 · internal anchor
PSFT modifies supervised fine-tuning by incorporating trust-region ideas from RL to constrain policy changes, yielding better out-of-domain generalization in math and human-value tasks without entropy collapse.
Emerging Properties in Unified Multimodal Pretraining cs.CV · 2025-05-20 · unverdicted · none · ref 21 · internal anchor
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 197 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 153 · internal anchor
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
REFLEX: Self-Refining Explainable Fact-Checking via Verdict-Anchored Style Control cs.CL · 2025-11-25 · unreviewed · ref 9 · internal anchor
Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards cs.LG · 2025-09-26 · unreviewed · ref 8 · internal anchor

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer