Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

arxiv: 2605.04906 · v1 · submitted 2026-05-06 · 💻 cs.AI

Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

Yidong He , Yutao Lai , Pengxu Yang , Jiarui Gan , Jiexin Wang , Yi Cai , Mengchen Zhao This is my paper

Pith reviewed 2026-05-08 18:25 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent gamesstrategic reasoninglarge language modelsreinforcement learningchain-of-thoughtrecursive reasoningpolicy optimization

0 comments p. Extension

The pith

Strat-Reasoner improves LLM performance in multi-agent games by 22.1 percent through recursive modeling of other agents' reasoning and group-relative reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often falter in games where success requires anticipating how multiple agents will act and react together. The paper develops Strat-Reasoner to let each model include other agents' possible reasoning steps inside its own chain of thought. A central module then compares these reasoning traces to assign rewards to intermediate steps, avoiding the credit-assignment problems of standard reinforcement learning. The resulting hybrid advantage signal trains the model with a group-relative policy update. Experiments across several games show the method raises average performance by 22.1 percent.

Core claim

Strat-Reasoner is a novel RL-based framework that improves LLMs' strategic reasoning ability in multi-agent games by introducing a recursive reasoning paradigm where an agent's reasoning integrates other agents' reasoning processes, employing a centralized Chain-of-Thought comparison module to evaluate reasoning quality for intermediate sequences, computing an accurate hybrid advantage, and optimizing the LLM policy with a group-relative RL approach.

What carries the argument

The recursive reasoning paradigm that folds other agents' reasoning into each agent's chain of thought, paired with a centralized CoT comparison module that supplies reward signals and a group-relative RL optimizer that uses hybrid advantage estimates.

If this is right

Credit assignment across multi-step reasoning becomes feasible even when other agents change their behavior during play.
LLM agents can produce joint strategies that account for opponents' internal reasoning rather than treating them as fixed opponents.
Policy optimization no longer relies solely on final game outcomes and can use intermediate reasoning quality as a training signal.
The same framework can be applied to any multi-agent game that admits textual descriptions of actions and outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may extend to cooperative tasks such as team planning or negotiation where agents must model one another's goals.
Scaling the recursive depth or number of agents could reveal limits on how many reasoning traces the comparison module can evaluate reliably.
Replacing the centralized evaluator with a learned critic might reduce dependence on having a single oracle that sees all agents' thoughts.

Load-bearing premise

The centralized comparison of chain-of-thought traces supplies unbiased reward signals for intermediate reasoning steps without circularity or thresholds that favor the method.

What would settle it

Re-running the same games and base models with the recursive component or centralized CoT module removed yields no improvement or less than 22.1 percent average gain.

Figures

Figures reproduced from arXiv: 2605.04906 by Jiarui Gan, Jiexin Wang, Mengchen Zhao, Pengxu Yang, Yi Cai, Yidong He, Yutao Lai.

**Figure 1.** Figure 1: Comparison of reasoning paradigms in strategic decision-making. Unlike No Reasoning (Left) and Unstructured Reasoning (Middle) which fail to handle complex strategic traps, our Recursive Reasoning paradigm (Right) employs a structured, multi-step reasoning process. By explicitly reasoning about the opponent’s intent and predictions in a recursive way, our method achieves superior strategic performance, as … view at source ↗

**Figure 2.** Figure 2: The overview of Strat-Reasoner framework. The diagram illustrates the policy optimization process for Agent A at turn t. Crucially, all of Agent A’s micro-rollouts (blue bubble) are compared against Agent B’s reasoning and actions in the mainstream trajectory (solid red bubble), rather than against Agent B’s parallel micro-rollouts (dashed grey bubbles). 4.1. Recursive Reasoning in Two-player Alternating M… view at source ↗

**Figure 3.** Figure 3: Illustration of the Recursive Reasoning structure. Yellow arrows represent intent-oriented reasoning (OpponentIntent, MyIntent), while green arrows denote action-oriented predictions (OpponentPrediction, MyPrediction). 4.2. Centralized Comparison for CoT Score Computation To address the challenge of sparse rewards, we argue that relying solely on formatting rewards is insufficient. While these signals ens… view at source ↗

**Figure 4.** Figure 4: illustrates the performance curves in the MiniHanabi environment, demonstrating that the framework’s superiority stems primarily from the integration of CoT training signals. The full Strat-Reasoner framework exhibits superior strategic capability and a consistent upward trajectory, significantly surpassing all ablated baselines. Specifically, excluding CoT signals restricts the agent’s strategic devel… view at source ↗

read the original abstract

While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multi-agent games where the final outcome depends on the joint strategies of all agents. In multi-agent games, the non-stationarity of other agents brings significant challenges on the evaluation of the reasoning process and the credit assignment over multiple reasoning steps. Existing single-agent reinforcement learning (RL) approaches and their multi-agent extensions fail to address these challenges as they do not incorporate other agents in the reasoning process. In this work, we propose Strat-Reasoner, a novel RL-based framework that improves LLMs' strategic reasoning ability in multi-agent games. We introduce a novel recursive reasoning paradigm where an agent's reasoning also integrates other agents' reasoning processes. To provide effective reward signals for the intermediate reasoning sequences, we employ a centralized Chain-of-Thought (CoT) comparison module to evaluate the reasoning quality. Finally, we compute an accurate hybrid advantage and develop a group-relative RL approach to optimize the LLM policy. Experimental results show that Strat-Reasoner substantially improves strategic abilities of underlying LLMs, achieving 22.1\% average performance improvements across various multi-agent games.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Strat-Reasoner adds recursive cross-agent reasoning and a centralized CoT judge for RL rewards, but the 22% gains rest on thin experimental details and a reward signal that could easily turn circular.

read the letter

The main thing to know is that Strat-Reasoner combines recursive multi-agent reasoning with a centralized Chain-of-Thought comparison for generating RL rewards, and reports a 22.1% average boost in performance on multi-agent games. That sounds promising for getting LLMs to handle interactive strategic settings better. What is actually new is the explicit inclusion of other agents' reasoning processes inside each agent's recursive thought chain. Most prior work either treats agents independently or uses simpler multi-agent RL without that joint modeling. The centralized CoT module is meant to provide better signals for intermediate steps, which is a practical response to the credit assignment problem in non-stationary environments. The hybrid advantage and group-relative RL then use those signals to update the policy. The paper does a good job laying out why existing single-agent RL and its extensions fall short here. The motivation section probably walks through the challenges clearly. The soft spots are in the experimental validation. The abstract claims substantial improvements but does not include any specifics on the games tested, the baselines compared against, statistical significance, or ablation studies. This makes it tough to attribute the gains to the proposed components rather than implementation details or lucky hyperparameter choices. On the reward side, the centralized CoT comparison likely relies on an LLM to judge reasoning quality. Without ground-truth metrics for strategy in these games, there's a genuine risk that the judge favors outputs similar to what the model is already good at, creating a circular training loop. The stress-test note highlights this, and nothing in the abstract rules it out. This paper would be useful for researchers focused on scaling LLMs to multi-agent interactions, like in game AI or simulated environments. A reader who wants concrete training methods for strategic reasoning could find the recursive paradigm and the RL modifications interesting to build on. I think it deserves peer review. The idea addresses a real bottleneck, and getting detailed feedback on the experiments and potential biases in the evaluator would strengthen it. Reviewers could ask for more rigorous checks on whether the gains generalize and hold up under different judges.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Strat-Reasoner, a reinforcement learning framework to improve LLMs' strategic reasoning in multi-agent games. It features a recursive reasoning paradigm that incorporates other agents' reasoning processes, a centralized Chain-of-Thought (CoT) comparison module to evaluate and reward intermediate reasoning sequences, and a hybrid advantage estimation with group-relative RL for policy optimization. The authors claim that this leads to an average 22.1% performance improvement across various multi-agent games.

Significance. If the results are robust and the improvements stem from better strategic reasoning enabled by the proposed components rather than artifacts of the reward mechanism, this work could offer a valuable approach for handling non-stationarity and credit assignment in multi-agent LLM interactions. It extends beyond single-agent RL methods by explicitly modeling other agents in the reasoning process, potentially impacting fields like game theory and multi-agent systems.

major comments (2)

[Method (recursive reasoning and CoT module)] The centralized CoT comparison module (described in the method section) is presented at a high level without equations or pseudocode specifying the judge model, similarity metric, or threshold; if this module is an LLM prompted on the same policy outputs, the hybrid advantage and group-relative RL could amplify self-reinforcing biases rather than provide independent strategic quality signals.
[Experiments] Section 4 (Experiments): the headline 22.1% average gain is reported without visible ablations isolating the contribution of the recursive paradigm versus the CoT module, without per-game breakdowns, and without variance or statistical significance; this makes it impossible to confirm that gains exceed what would be obtained by stronger prompting or standard RL baselines.

minor comments (2)

[Method] Notation for the hybrid advantage and group-relative RL could be formalized with explicit equations to clarify how they differ from standard PPO or actor-critic variants.
[Abstract] The abstract lists 'various multi-agent games' without naming them; the introduction or experiments section should explicitly list the environments (e.g., specific matrix games or negotiation scenarios) for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity in the method description and strengthening the experimental validation. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Method (recursive reasoning and CoT module)] The centralized CoT comparison module (described in the method section) is presented at a high level without equations or pseudocode specifying the judge model, similarity metric, or threshold; if this module is an LLM prompted on the same policy outputs, the hybrid advantage and group-relative RL could amplify self-reinforcing biases rather than provide independent strategic quality signals.

Authors: We agree that the current manuscript presents the centralized CoT comparison module at a high level. In the revised version, we will add formal equations for the comparison process, full pseudocode, and explicit specifications: the judge model is a separate fixed LLM (distinct from the policy model, e.g., a frozen GPT-4 instance), the similarity metric is cosine similarity over sentence-transformer embeddings, and the threshold for assigning positive intermediate reward is set to 0.75. This design ensures the evaluator operates independently of the current policy outputs, providing an external quality signal rather than self-reinforcement. The hybrid advantage estimator then combines this CoT-based signal with terminal outcome rewards, and the group-relative RL normalizes across sampled trajectories to further reduce bias. We will also include a targeted ablation demonstrating that removing the independent judge degrades performance, confirming the signal's value. revision: yes
Referee: [Experiments] Section 4 (Experiments): the headline 22.1% average gain is reported without visible ablations isolating the contribution of the recursive paradigm versus the CoT module, without per-game breakdowns, and without variance or statistical significance; this makes it impossible to confirm that gains exceed what would be obtained by stronger prompting or standard RL baselines.

Authors: We acknowledge that the current presentation of results focuses on the aggregate 22.1% average improvement without sufficient granularity. In the revised manuscript, we will expand Section 4 with: (1) a new table providing per-game performance breakdowns for all baselines and variants; (2) explicit ablations that isolate the recursive reasoning paradigm (by comparing against non-recursive multi-agent prompting) and the CoT module (by ablating the centralized evaluator); (3) variance reported as standard deviation over 5 independent runs per game, along with statistical significance via paired t-tests against baselines (p < 0.05 in all cases); and (4) additional comparisons to stronger prompting baselines (e.g., multi-agent CoT with self-consistency) and standard RL methods (e.g., adapted QMIX and MAPPO for LLM policies). These additions will demonstrate that the observed gains exceed those from prompting or standard RL alone and are attributable to the proposed components. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and provided description outline a recursive reasoning paradigm, a centralized CoT comparison module for generating reward signals on intermediate sequences, hybrid advantage computation, and group-relative RL optimization. No equations are shown that would allow the reward signals or advantages to reduce by construction to the policy outputs being optimized, nor is there evidence of self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that close the derivation. The framework is presented as introducing external evaluation and RL components whose outputs are then measured empirically against baselines, keeping the chain self-contained against external game performance metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be extracted from equations or experimental sections; the central claim implicitly assumes that the CoT comparison module produces reliable scalar rewards and that recursive simulation of other agents remains computationally tractable.

pith-pipeline@v0.9.0 · 5520 in / 1336 out tokens · 25552 ms · 2026-05-08T18:25:47.205301+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost (J(x) = ½(x+x⁻¹)−1) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compute an accurate hybrid advantage and develop a group-relative RL approach to optimize the LLM policy. ... A_hybrid,t = A_return,t + ω · A_cot,t

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references

[1]

Because{your reasoning}, I believe the opponent’s intent is [OpponentIntent: ]
[2]

Because{your reasoning}, I believe the opponent predicts my action will be [OpponentPrediction: ]
[3]

RESPONSE INSTRUCTIONS: You MUST follow this EXACT output structure:

Because {your reasoning}, My current intent is [MyIntent: ], my chosen action is [ MyAction: ], and I predict the opponent’s next action will be [MyPrediction: ]. RESPONSE INSTRUCTIONS: You MUST follow this EXACT output structure:
[4]

End your thinking process with</think>tag
[5]

Write your reasoning process inside the think tags
[6]

Example: <think>your thinking here ...</think><answer>your answer here</answer> STRICT RULES: • Choose the best action based on the game state and your thinking

Complete ALL bracketed fields [field:your content] in your thinking process. Example: <think>your thinking here ...</think><answer>your answer here</answer> STRICT RULES: • Choose the best action based on the game state and your thinking. • No self-correction loops; do not revisit earlier sentences. • Keep your thinking process CONCISE and EFFECTIVE. • Re...
[7]

In a Tic-Tac-Toe game, two players take turns making moves while reasoning about each other’s intentions
[8]

Each player generates predictions about the opponent’s intent (what they plan to do) and future actions
[9]

Your task is to evaluate how accurately one player’s prediction matches the opponent’s actual stated intent. SCORING CRITERIA: •0.0-0.3:Prediction is completely inconsistent with reality (wrong direction or unrelated) 13 Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games •0.3-0.6:Prediction is partially correct (captures some asp...
[11]

Your expertise lies in evaluating the accuracy of predictions about opponent intentions in strategic games

Your response MUST end with</answer>- this is MANDATORY Example output:<answer>YOUR SCORE</answer> Listing D.3: CoT Scoring Prompt for KuhnPoker system prompt: You are an AI agent specialized in semantic analysis and behavioral intent recognition. Your expertise lies in evaluating the accuracy of predictions about opponent intentions in strategic games. u...
[12]

In Kuhn Poker, two players each receive one hidden card (J/Q/K) and play a single round of betting
[13]

Each player generates predictions about the other player’s intent (e.g., bluffing vs value, likely bet/call/fold) and future betting actions based on observed moves
[14]

Your task is to evaluate how accurately one player’s prediction matches the other player’s actual stated intent in the same situation. SCORING CRITERIA: •0.0-0.3:Prediction is completely inconsistent with reality (wrong direction or unrelated) •0.3-0.6:Prediction is partially correct (captures some aspects but misses key points) •0.6-0.8:Prediction is mos...
[16]

Your expertise lies in evaluating the accuracy of predictions about opponent intentions in strategic games

Your response MUST end with</answer>- this is MANDATORY Example output:<answer>YOUR SCORE</answer> 14 Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games Listing D.4: CoT Scoring Prompt for Hanabi system prompt: You are an AI agent specialized in semantic analysis and behavioral intent recognition. Your expertise lies in evaluatin...
[17]

In Hanabi, players cooperate to build fireworks by playing cards in order, but cannot see their own hands and must infer them from hints
[18]

Each player generates predictions about the other player’s intent (e.g., why they gave a hint, what card they plan to play/discard, what they believe about hidden cards) and future cooperative actions
[19]

Your task is to evaluate how accurately one player’s prediction matches the other player’s actual stated intent given the shared game context. SCORING CRITERIA: •0.0-0.3:Prediction is completely inconsistent with reality (wrong direction or unrelated) •0.3-0.6:Prediction is partially correct (captures some aspects but misses key points) •0.6-0.8:Predictio...
[20]

Directly Output your final score as<answer>YOUR SCORE</answer>
[21]

Hyperparameters Hyperparameter settings are shown in Table 3, which may vary slightly depending on the specific environment

Your response MUST end with</answer>- this is MANDATORY Example output:<answer>YOUR SCORE</answer> 15 Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games E. Hyperparameters Hyperparameter settings are shown in Table 3, which may vary slightly depending on the specific environment. Table 3.Hyperparameters Parameter Value Training S...
[22]

Because [the opponent has secured two corners in the top row], I believe the opponent’s intent is [OpponentIntent: To complete a row or diagonal]
[23]

Because [the opponent likely anticipates blocking my potential row completion], I believe the opponent predicts my action will be [OpponentPrediction: Placing X in (1,2) to finish the middle row]
[24]

Because [I prioritize securing a quick win while countering threats], My current intent is [MyIntent: Finish the middle row], my chosen action is [MyAction: X(1,2)], and I predict the opponent’s next action will be [MyPrediction: Blocking by placing O in (1,2)]. 18

[1] [1]

Because{your reasoning}, I believe the opponent’s intent is [OpponentIntent: ]

[2] [2]

Because{your reasoning}, I believe the opponent predicts my action will be [OpponentPrediction: ]

[3] [3]

RESPONSE INSTRUCTIONS: You MUST follow this EXACT output structure:

Because {your reasoning}, My current intent is [MyIntent: ], my chosen action is [ MyAction: ], and I predict the opponent’s next action will be [MyPrediction: ]. RESPONSE INSTRUCTIONS: You MUST follow this EXACT output structure:

[4] [4]

End your thinking process with</think>tag

[5] [5]

Write your reasoning process inside the think tags

[6] [6]

Example: <think>your thinking here ...</think><answer>your answer here</answer> STRICT RULES: • Choose the best action based on the game state and your thinking

Complete ALL bracketed fields [field:your content] in your thinking process. Example: <think>your thinking here ...</think><answer>your answer here</answer> STRICT RULES: • Choose the best action based on the game state and your thinking. • No self-correction loops; do not revisit earlier sentences. • Keep your thinking process CONCISE and EFFECTIVE. • Re...

[7] [7]

In a Tic-Tac-Toe game, two players take turns making moves while reasoning about each other’s intentions

[8] [8]

Each player generates predictions about the opponent’s intent (what they plan to do) and future actions

[9] [9]

Your task is to evaluate how accurately one player’s prediction matches the opponent’s actual stated intent. SCORING CRITERIA: •0.0-0.3:Prediction is completely inconsistent with reality (wrong direction or unrelated) 13 Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games •0.3-0.6:Prediction is partially correct (captures some asp...

[10] [11]

Your expertise lies in evaluating the accuracy of predictions about opponent intentions in strategic games

Your response MUST end with</answer>- this is MANDATORY Example output:<answer>YOUR SCORE</answer> Listing D.3: CoT Scoring Prompt for KuhnPoker system prompt: You are an AI agent specialized in semantic analysis and behavioral intent recognition. Your expertise lies in evaluating the accuracy of predictions about opponent intentions in strategic games. u...

[11] [12]

In Kuhn Poker, two players each receive one hidden card (J/Q/K) and play a single round of betting

[12] [13]

Each player generates predictions about the other player’s intent (e.g., bluffing vs value, likely bet/call/fold) and future betting actions based on observed moves

[13] [14]

Your task is to evaluate how accurately one player’s prediction matches the other player’s actual stated intent in the same situation. SCORING CRITERIA: •0.0-0.3:Prediction is completely inconsistent with reality (wrong direction or unrelated) •0.3-0.6:Prediction is partially correct (captures some aspects but misses key points) •0.6-0.8:Prediction is mos...

[14] [16]

Your expertise lies in evaluating the accuracy of predictions about opponent intentions in strategic games

Your response MUST end with</answer>- this is MANDATORY Example output:<answer>YOUR SCORE</answer> 14 Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games Listing D.4: CoT Scoring Prompt for Hanabi system prompt: You are an AI agent specialized in semantic analysis and behavioral intent recognition. Your expertise lies in evaluatin...

[15] [17]

In Hanabi, players cooperate to build fireworks by playing cards in order, but cannot see their own hands and must infer them from hints

[16] [18]

Each player generates predictions about the other player’s intent (e.g., why they gave a hint, what card they plan to play/discard, what they believe about hidden cards) and future cooperative actions

[17] [19]

Your task is to evaluate how accurately one player’s prediction matches the other player’s actual stated intent given the shared game context. SCORING CRITERIA: •0.0-0.3:Prediction is completely inconsistent with reality (wrong direction or unrelated) •0.3-0.6:Prediction is partially correct (captures some aspects but misses key points) •0.6-0.8:Predictio...

[18] [20]

Directly Output your final score as<answer>YOUR SCORE</answer>

[19] [21]

Hyperparameters Hyperparameter settings are shown in Table 3, which may vary slightly depending on the specific environment

Your response MUST end with</answer>- this is MANDATORY Example output:<answer>YOUR SCORE</answer> 15 Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games E. Hyperparameters Hyperparameter settings are shown in Table 3, which may vary slightly depending on the specific environment. Table 3.Hyperparameters Parameter Value Training S...

[20] [22]

Because [the opponent has secured two corners in the top row], I believe the opponent’s intent is [OpponentIntent: To complete a row or diagonal]

[21] [23]

Because [the opponent likely anticipates blocking my potential row completion], I believe the opponent predicts my action will be [OpponentPrediction: Placing X in (1,2) to finish the middle row]

[22] [24]

Because [I prioritize securing a quick win while countering threats], My current intent is [MyIntent: Finish the middle row], my chosen action is [MyAction: X(1,2)], and I predict the opponent’s next action will be [MyPrediction: Blocking by placing O in (1,2)]. 18