Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Nadav Cohen; Shahar Mendel; Yotam Alexander; Yuval Ran-Milo

arxiv: 2601.15158 · v4 · pith:MXQRVVEXnew · submitted 2026-01-21 · 💻 cs.LG · cs.AI

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Yuval Ran-Milo , Yotam Alexander , Shahar Mendel , Nadav Cohen This is my paper

classification 💻 cs.LG cs.AI

keywords gradientpolicyreasoningtransformerschain-of-thoughtdataexamplesgraph

0 comments

read the original abstract

Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler examples, the Transformer learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, policy gradient learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
cs.LG 2026-04 unverdicted novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...