hub

Learning by distilling context

Charlie Snell, Dan Klein, Ruiqi Zhong · 2022 · arXiv 2209.15189

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

cs.LG · 2026-05-12 · conditional · novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.

CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.

Near-Future Policy Optimization

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

Context Memorization for Efficient Long Context Generation

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Attention-state memory externalizes long prefixes into a lightweight lookup table of precomputed attention states, yielding higher accuracy than standard in-context learning at fixed memory budgets and lower latency than full attention.

Self-Supervised On-Policy Distillation for Reasoning Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.

VSPO: Vector-Steered Policy Optimization for Behavioral Control

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

cs.CL · 2026-05-08 · conditional · novelty 6.0

Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.

TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing token use.

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

cs.CR · 2024-04-19 · unverdicted · novelty 6.0

Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.

Large Language Models Can Self-Improve

cs.CL · 2022-10-20 · unverdicted · novelty 6.0

A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

cs.CL · 2026-04-09 · accept · novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

Tuning Qwen2.5-VL to Improve Its Web Interaction Skills

cs.HC · 2026-02-20 · unverdicted · novelty 5.0

Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.

It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs

cs.LG · 2026-05-18 · unverdicted · novelty 4.0

SELFCI uses complementary self-distillation with two reverse KL divergences to align LLMs to contextual integrity while preserving utility, outperforming RL baselines like GRPO in agentic settings.

citing papers explorer

Showing 16 of 16 citing papers.

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation cs.LG · 2026-05-12 · conditional · none · ref 33
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why cs.LG · 2026-05-11 · unverdicted · none · ref 20
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization cs.LG · 2026-05-09 · unverdicted · none · ref 28
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
Near-Future Policy Optimization cs.LG · 2026-04-22 · unverdicted · none · ref 22
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 297
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning cs.LG · 2026-05-21 · unverdicted · none · ref 5
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
Context Memorization for Efficient Long Context Generation cs.CL · 2026-05-18 · unverdicted · none · ref 36
Attention-state memory externalizes long prefixes into a lightweight lookup table of precomputed attention states, yielding higher accuracy than standard in-context learning at fixed memory budgets and lower latency than full attention.
Self-Supervised On-Policy Distillation for Reasoning Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 1
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
VSPO: Vector-Steered Policy Optimization for Behavioral Control cs.LG · 2026-05-15 · unverdicted · none · ref 22
VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts cs.CL · 2026-05-08 · conditional · none · ref 15
Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation cs.CL · 2026-04-09 · unverdicted · none · ref 58
TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing token use.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions cs.CR · 2024-04-19 · unverdicted · none · ref 10
Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.
Large Language Models Can Self-Improve cs.CL · 2022-10-20 · unverdicted · none · ref 12
A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning cs.CL · 2026-04-09 · accept · none · ref 58
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
Tuning Qwen2.5-VL to Improve Its Web Interaction Skills cs.HC · 2026-02-20 · unverdicted · none · ref 12
Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.
It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs cs.LG · 2026-05-18 · unverdicted · none · ref 39
SELFCI uses complementary self-distillation with two reverse KL divergences to align LLMs to contextual integrity while preserving utility, outperforming RL baselines like GRPO in agentic settings.

Learning by distilling context

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer