ERPD decouples aggressive off-policy optimization on fixed trajectories from trust-region distillation to achieve comparable or better LLM performance with substantially smaller KL divergence.
Nanbeige4-3b technical report: Exploring the frontier of small language models.arXiv preprint arXiv:2512.06266,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Extreme Region Policy Distillation
ERPD decouples aggressive off-policy optimization on fixed trajectories from trust-region distillation to achieve comparable or better LLM performance with substantially smaller KL divergence.