On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
Think inside the json: Reinforcement strategy for strict llm schema adherence
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
method 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
method 1polarities
use method 1representative citing papers
DiScO enhances LLM mathematical reasoning by training for awareness of diverse thinking schemata, using RL to promote diversity, and applying it at inference, outperforming standard GRPO.
citing papers explorer
-
Diverse Thinking Schemata Elicit Better Reasoning in Large Language Models
DiScO enhances LLM mathematical reasoning by training for awareness of diverse thinking schemata, using RL to promote diversity, and applying it at inference, outperforming standard GRPO.