Logit averaging inside GRPO yields higher or comparable benchmark accuracy to KL-regularized GRPO without using KL terms or a critic.
Measuring massive multitask language understanding
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs
Logit averaging inside GRPO yields higher or comparable benchmark accuracy to KL-regularized GRPO without using KL terms or a critic.