LLM reasoning generalization under weak RL supervision is governed by prolonged reward saturation dynamics and pre-RL reasoning faithfulness, enabled by SFT on explicit reasoning traces plus domain pre-training.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
When Can LLMs Learn to Reason with Weak Supervision?
LLM reasoning generalization under weak RL supervision is governed by prolonged reward saturation dynamics and pre-RL reasoning faithfulness, enabled by SFT on explicit reasoning traces plus domain pre-training.