RL post-trained models show stronger awareness of learned policies and better generalization to new tasks than SFT models, but display weaker alignment between internal reasoning traces and final outputs, especially under GRPO.
InProceedings of the 22nd ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining, pages 1135–1144
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Thinking About Thinking: Evaluating Reasoning in Post-Trained Language Models
RL post-trained models show stronger awareness of learned policies and better generalization to new tasks than SFT models, but display weaker alignment between internal reasoning traces and final outputs, especially under GRPO.