LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.
T−1X t=0 γt rt # ,(21) wherer t is the reward at steptandγ∈[0,1]is a discount factor. Under the actor–critic framework, the policy gradient theorem states that ∇θJ(θ) =E πθ
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?
LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.