STRATAGEM uses a Reasoning Transferability Coefficient and Reasoning Evolution Reward in game self-play to promote domain-agnostic reasoning in language models, yielding gains on math, general reasoning, and code benchmarks.
K.3 Role-Conditioned Advantage Estimation A critical challenge in two-player games is that the expected return differs by role
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play
STRATAGEM uses a Reasoning Transferability Coefficient and Reasoning Evolution Reward in game self-play to promote domain-agnostic reasoning in language models, yielding gains on math, general reasoning, and code benchmarks.