ProteinOPD uses token-level on-policy distillation from multiple preference-specific teacher models into a shared student to balance competing objectives in protein design, delivering gains on targets without losing designability and an 8x speedup over RL baselines.
Towards on-policy sft: Distribution discriminant theory and its applications in llm training
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
other 1polarities
unclear 1representative citing papers
FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.
DASD dynamically selects tokens in self-distillation to keep logical corrections while suppressing stylistic noise, improving robustness on math, code, and commonsense benchmarks.
citing papers explorer
-
ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design
ProteinOPD uses token-level on-policy distillation from multiple preference-specific teacher models into a shared student to balance competing objectives in protein design, delivering gains on targets without losing designability and an 8x speedup over RL baselines.
-
Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation
FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.
-
Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation
DASD dynamically selects tokens in self-distillation to keep logical corrections while suppressing stylistic noise, improving robustness on math, code, and commonsense benchmarks.