Distilling system 2 into system 1

Yu, P · 2024 · arXiv 2407.06023

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Purified OPSD: On-Policy Self-Distillation Without Losing How to Think

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

Purified OPSD subtracts a reference-only teacher's signal from standard OPSD supervision and applies PMI to create a cleaner distillation target, yielding gains on long-CoT models while preserving epistemic behavior.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

Training Large Language Models to Reason in a Continuous Latent Space

cs.CL · 2024-12-09 · unverdicted · novelty 7.0

Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency trade-offs.

E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation

cs.RO · 2026-06-25 · unverdicted · novelty 6.0

E-TTS introduces a plug-and-play test-time scaling method for embodied tasks that unifies reasoning-action sampling with history buffers and closed-loop refinement to improve performance on manipulation benchmarks.

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

Dynamic Rollout Editing reduces overthinking in RL-trained LLMs by editing post-answer continuations in successful rollouts and preferring the edited versions within GRPO groups.

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token reduction on DeepSeek-R1-Distill-Qwen-7B while preserving accuracy.

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

cs.LG · 2026-04-12 · unverdicted · novelty 6.0

Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.

Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models

cs.CL · 2026-01-16 · unverdicted · novelty 6.0

NCoTS treats chain-of-thought reasoning as a search problem and uses a dual-factor heuristic to find paths that are over 3.5% more accurate and 22% shorter on benchmarks.

NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

cs.LG · 2026-05-06 · unverdicted · novelty 5.0

Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.

Efficient Test-Time Scaling via Temporal Reasoning Aggregation

cs.AI · 2026-04-19 · unverdicted · novelty 5.0

TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.

Self-Aligned Reward: Towards Effective and Efficient Reasoners

cs.LG · 2025-09-05 · unverdicted · novelty 5.0

Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

cs.CL · 2025-03-20 · accept · novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

Knowledge Distillation Must Account for What It Loses

cs.LG · 2026-04-28 · unverdicted · novelty 4.0 · 2 refs

Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.

citing papers explorer

Showing 11 of 11 citing papers after filters.

Purified OPSD: On-Policy Self-Distillation Without Losing How to Think cs.AI · 2026-07-02 · unverdicted · none · ref 21
Purified OPSD subtracts a reference-only teacher's signal from standard OPSD supervision and applies PMI to create a cleaner distillation target, yielding gains on long-CoT models while preserving epistemic behavior.
Training Large Language Models to Reason in a Continuous Latent Space cs.CL · 2024-12-09 · unverdicted · none · ref 34
Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency trade-offs.
E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation cs.RO · 2026-06-25 · unverdicted · none · ref 44
E-TTS introduces a plug-and-play test-time scaling method for embodied tasks that unifies reasoning-action sampling with history buffers and closed-loop refinement to improve performance on manipulation benchmarks.
Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models cs.CL · 2026-06-16 · unverdicted · none · ref 10
Dynamic Rollout Editing reduces overthinking in RL-trained LLMs by editing post-answer continuations in successful rollouts and preferring the edited versions within GRPO groups.
ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning cs.AI · 2026-06-02 · unverdicted · none · ref 35
ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token reduction on DeepSeek-R1-Distill-Qwen-7B while preserving accuracy.
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents cs.LG · 2026-04-12 · unverdicted · none · ref 42
Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.
Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models cs.CL · 2026-01-16 · unverdicted · none · ref 14
NCoTS treats chain-of-thought reasoning as a search problem and uses a dual-factor heuristic to find paths that are over 3.5% more accurate and 22% shorter on benchmarks.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning cs.LG · 2026-05-06 · unverdicted · none · ref 53
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
Efficient Test-Time Scaling via Temporal Reasoning Aggregation cs.AI · 2026-04-19 · unverdicted · none · ref 20
TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.
Self-Aligned Reward: Towards Effective and Efficient Reasoners cs.LG · 2025-09-05 · unverdicted · none · ref 51
Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
Knowledge Distillation Must Account for What It Loses cs.LG · 2026-04-28 · unverdicted · none · ref 36 · 2 links
Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.

Distilling system 2 into system 1

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer