UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.
Vla-opd: Bridging offline sft and online rl for vision-language-action models via on-policy distillation
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.
Coherent IRL learns dense rewards from demos to enable sample-efficient off-policy improvement of large behavior-cloned policies on sparse robotic manipulation tasks.
citing papers explorer
-
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.
-
Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation
FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.
-
Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards
Coherent IRL learns dense rewards from demos to enable sample-efficient off-policy improvement of large behavior-cloned policies on sparse robotic manipulation tasks.