LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning

Sashenka Gamage; Tommaso Felice Banfi

arxiv: 2601.10775 · v2 · submitted 2026-01-15 · 💻 cs.CL · cs.GT· cs.LG

LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning

Tommaso Felice Banfi , Sashenka Gamage This is my paper

Pith reviewed 2026-05-16 13:52 UTC · model grok-4.3

classification 💻 cs.CL cs.GTcs.LG

keywords entropy-guided reasoningadaptive chain-of-thoughtin-context learninggame theoryTic-Tac-ToeLLM decision makinguncertainty estimation

0 comments

The pith

Entropy-guided adaptive chain-of-thought reasoning raises LLM Tic-Tac-Toe outcomes from -11.6% to +9.5% average score.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an LLM framework for game decisions that uses token entropy to decide how much context and how many reasoning paths to employ. When entropy is low the model retrieves few examples and follows a single short chain of thought; when entropy rises it pulls more examples and explores multiple paths in parallel. This adaptive rule produces a statistically significant lift in average game score against a weak opponent while keeping the total number of model calls modest. Analysis also shows a negative link between token entropy and the quality of the chosen move.

Core claim

Entropy-aware adaptive reasoning substantially improves decision quality, increasing the average game outcome from -11.6% with the baseline LLM to +9.5% with entropy-guided adaptive reasoning over 100 games (win = +1, tie = 0, loss = -1), while maintaining a relatively low number of LLM queries per game.

What carries the argument

The entropy-guided adaptive CoT mechanism that dynamically scales the number of retrieved in-context examples and the number of parallel reasoning paths according to token-level uncertainty.

If this is right

The improvement is statistically significant.
Higher token entropy correlates with less optimal moves.
Query count per game stays relatively low compared with fixed large-context baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy signal might let the method transfer to other turn-based games if the correlation holds.
Testing against stronger opponents would show whether the adaptive expansion still delivers gains when baseline moves are already near-optimal.
Entropy could be used as a cheap uncertainty proxy for deciding when to invoke more expensive multi-step reasoning in other sequential decision tasks.

Load-bearing premise

Token-level entropy in the LLM output reliably signals move optimality and the chosen thresholds for context size and path count work beyond Tic-Tac-Toe against one sub-optimal opponent.

What would settle it

Applying the same adaptive rule in a new game such as chess against an optimal opponent and finding neither outcome improvement nor a negative entropy-optimality correlation would falsify the claim.

read the original abstract

We propose a novel LLM-based framework for reasoning in discrete, game-theoretic tasks, illustrated with \emph{Tic-Tac-Toe}. The method integrates in-context learning with entropy-guided chain-of-thought (CoT) reasoning and adaptive context retrieval. The model dynamically adjusts both the number of retrieved examples and reasoning paths according to token-level uncertainty: concise reasoning with minimal context is used when uncertainty is low, whereas higher uncertainty triggers expanded multi-path CoT exploration. Experimental evaluation against a sub-optimal algorithmic opponent shows that entropy-aware adaptive reasoning substantially improves decision quality, increasing the average game outcome from \(-11.6\%\) with the baseline LLM to \(+9.5\%\) with entropy-guided adaptive reasoning over 100 games (win = +1, tie = 0, loss = -1), while maintaining a relatively low number of LLM queries per game. Statistical validation confirms that the improvement is significant, and correlation analysis reveals a negative association between token-level entropy and move optimality. These findings demonstrate that uncertainty-guided adaptive reasoning effectively enhances LLM performance in sequential decision-making environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Entropy-guided adaptation lifts Tic-Tac-Toe scores from -11.6% to +9.5% but the optimality correlation lacks a perfect-play check.

read the letter

The main point is that token entropy can be used to scale both the number of in-context examples and the depth of multi-path chain-of-thought when an LLM picks moves in Tic-Tac-Toe. Low entropy triggers short, minimal-context reasoning; high entropy pulls more examples and explores several paths. Against one fixed sub-optimal opponent the average outcome improves from -11.6% to +9.5% over 100 games, with the paper reporting statistical significance and a negative correlation between entropy and move quality. Query count stays relatively low because the model does not over-reason on easy positions. This is a practical, low-overhead control knob rather than a new theoretical result. It combines existing in-context learning and CoT techniques in an adaptive way that fits sequential decision tasks. The empirical numbers and the correlation give the claim some grounding. The soft spots are straightforward. The abstract gives no exact entropy formula, no description of how thresholds were set, and no full baseline prompting details, so the gains are hard to reproduce precisely from the text alone. More critically, the correlation is measured only on positions that arise against one weak opponent; without an external optimality oracle such as minimax values, it is possible the model simply assigns lower entropy to positions it has seen more often in training or examples rather than to objectively stronger moves. That distinction matters for any claim that entropy tracks decision quality. The work is aimed at people building LLM agents for simple discrete games or planning problems who want an uncertainty-based way to allocate reasoning effort. It is narrow in scope but the idea is concrete enough that others could test the same mechanism on different games or opponents. I would send it to peer review. The reported gains are positive and the adaptive rule is clear, so referees can ask for the missing implementation details and an independent optimality check without starting from zero.

Referee Report

2 major / 2 minor

Summary. The paper proposes an entropy-guided adaptive in-context learning and chain-of-thought framework for LLMs on discrete game tasks, using Tic-Tac-Toe as the running example. Token-level entropy dynamically controls the number of retrieved examples and reasoning paths (concise mode for low uncertainty, expanded multi-path exploration for high uncertainty). Over 100 games against one fixed sub-optimal opponent the method raises average outcome from -11.6 % (baseline LLM) to +9.5 % (adaptive), with statistical significance and a reported negative correlation between entropy and move quality.

Significance. If the entropy-optimality link and the reported gains are reproducible, the work supplies a lightweight, oracle-free mechanism for improving LLM sequential decision quality. The low query count and explicit uncertainty-driven adaptation are practical strengths that could transfer to other planning or game settings, provided the correlation generalizes beyond the narrow experimental regime.

major comments (2)

[Abstract and Methods] Abstract and Methods: the exact token-level entropy formula, the numerical thresholds that trigger expanded context or additional reasoning paths, and the precise baseline prompting template are never stated, so the contribution of the entropy signal versus other prompting choices cannot be isolated.
[Correlation Analysis] Correlation Analysis: the negative association between entropy and move optimality is measured only on positions generated against one fixed sub-optimal opponent; no comparison to an external optimality oracle (minimax value or perfect-play evaluation) is described, leaving open the possibility that lower entropy simply reflects training-data familiarity rather than objective move quality.

minor comments (2)

[Experimental Results] Provide the full list of 100 game outcomes or at least summary statistics per position type so readers can verify the reported average and statistical test.
[Abstract] Clarify whether the adaptive rule is claimed to be parameter-free; the presence of tunable entropy thresholds and maximum path counts suggests otherwise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review and the recommendation for major revision. We appreciate the feedback on clarity and experimental rigor. We address each major comment below, indicating the revisions we will implement.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: the exact token-level entropy formula, the numerical thresholds that trigger expanded context or additional reasoning paths, and the precise baseline prompting template are never stated, so the contribution of the entropy signal versus other prompting choices cannot be isolated.

Authors: We agree that these implementation details are essential for reproducibility and for isolating the entropy signal. In the revised manuscript we will add the precise token-level entropy formula (negative log-likelihood averaged over generated tokens), the numerical thresholds used to switch modes (e.g., entropy below 1.2 triggers concise single-path reasoning with minimal context; entropy above 2.5 triggers expanded multi-path exploration), and the full baseline prompting template in the Methods section and an appendix. These additions will allow readers to replicate the exact conditions and evaluate the incremental benefit of the entropy-guided adaptation. revision: yes
Referee: [Correlation Analysis] Correlation Analysis: the negative association between entropy and move optimality is measured only on positions generated against one fixed sub-optimal opponent; no comparison to an external optimality oracle (minimax value or perfect-play evaluation) is described, leaving open the possibility that lower entropy simply reflects training-data familiarity rather than objective move quality.

Authors: We acknowledge the limitation of evaluating the correlation solely on positions arising from play against a single sub-optimal opponent. While this regime matches the primary experimental setting, an external oracle would strengthen the interpretation. In the revision we will include a supplementary analysis that computes minimax values for a held-out set of Tic-Tac-Toe positions (both in-distribution and out-of-distribution) and reports the correlation between token entropy and minimax move quality. This will help distinguish data familiarity from objective optimality. We note that the main performance gains (from -11.6 % to +9.5 % average outcome) remain the central empirical result and are obtained under the same opponent used for the correlation analysis. revision: partial

Circularity Check

0 steps flagged

No circularity: results are direct empirical comparisons

full rationale

The paper proposes an entropy-guided adaptive CoT framework for Tic-Tac-Toe and reports experimental outcomes (baseline -11.6% to +9.5% average game score over 100 games). No derivation chain exists that reduces a claimed prediction or first-principles result to its own fitted inputs or self-citations by construction. The reported negative correlation between token entropy and move optimality is measured directly from the same experimental traces against one fixed opponent; it is not obtained by fitting a parameter and then relabeling it as a prediction, nor does any equation or uniqueness theorem collapse the result onto itself. The method's adaptive rules are heuristic thresholds chosen by the authors and validated by ablation, not derived from prior self-cited theorems that would make the improvement tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework adds adaptive control logic on top of standard LLM prompting; the main unstated premises are that entropy is a valid uncertainty proxy and that the chosen adaptation rules improve outcomes without introducing new bias.

free parameters (2)

entropy threshold for expanded reasoning
Determines switch between concise and multi-path modes; concrete value not stated in abstract
maximum number of retrieved examples and reasoning paths
Upper bounds on adaptive expansion; values not provided

axioms (1)

domain assumption Token-level entropy in LLM output is a reliable indicator of move optimality
Invoked to justify the adaptive trigger and supported only by the reported correlation

pith-pipeline@v0.9.0 · 5497 in / 1358 out tokens · 75177 ms · 2026-05-16T13:52:44.293442+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

token-level entropy ... Htoken_t,k = −∑ p(i) log p(i) ... Hstep_t ... entropy thresholds 0 = H0 < H1 < ⋯ < Hm ... nt = min(nj, |A(st)|)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

negative association between token-level entropy and move optimality ... Spearman ρ = −0.471

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.