Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.
hub Canonical reference
Black-box on-policy distillation of large language models.arXiv preprint
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
years
2026 16representative citing papers
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Adversarial training combines verifiable RL rewards with a discriminator proxy for human outputs to jointly optimize accuracy and non-verifiable qualities like naturalness in language models.
PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
Local teachability collapse occurs in later trajectory segments during strong-to-weak OPD; a margin-based release rule using top-K teacher advantage and BIC change-point detection on sentence segments outperforms full-trajectory supervision on five in-domain benchmarks and preserves out-of-domain pe
MOPD improves on-policy distillation by using peer successes and failures from multiple rollouts to construct more informative teacher signals, yielding consistent gains over baselines on reasoning benchmarks.
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
PADD distills from dense teachers to MoE students via neuron clustering, expert warmup, online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing, yielding gains on math reasoning benchmarks.
f-OPD decomposes on-policy distillation drift into rollout and supervision components, then applies a sample-level freshness score to adaptively limit stale data influence and stabilize long-horizon agent training.
Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
citing papers explorer
No citing papers match the current filters.