K.3 Role-Conditioned Advantage Estimation A critical challenge in two-player games is that the expected return differs by role

Policy Update: Update θ using policy gradient with role-conditioned advantages The self-play mechanism ensures automatic curriculum learning: as the policy improves, its oppone

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

STRATAGEM uses a Reasoning Transferability Coefficient and Reasoning Evolution Reward in game self-play to promote domain-agnostic reasoning in language models, yielding gains on math, general reasoning, and code benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play cs.AI · 2026-04-20 · unverdicted · none · ref 12
STRATAGEM uses a Reasoning Transferability Coefficient and Reasoning Evolution Reward in game self-play to promote domain-agnostic reasoning in language models, yielding gains on math, general reasoning, and code benchmarks.

K.3 Role-Conditioned Advantage Estimation A critical challenge in two-player games is that the expected return differs by role

fields

years

verdicts

representative citing papers

citing papers explorer