Evaluating Counterfactual Strategic Reasoning in Large Language Models

Angeliki Dimitriou; Dimitrios Georgousis; Giorgos Filandrianos; Giorgos Stamou; Maria Lymperaiou

arxiv: 2603.19167 · v2 · pith:BKLO5P3Snew · submitted 2026-03-19 · 💻 cs.CL

Evaluating Counterfactual Strategic Reasoning in Large Language Models

Dimitrios Georgousis , Maria Lymperaiou , Angeliki Dimitriou , Giorgos Filandrianos , Giorgos Stamou This is my paper

Pith reviewed 2026-05-25 06:32 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsstrategic reasoningcounterfactual evaluationprisoners dilemmarock paper scissorsgame theorymemorized patterns

0 comments

The pith

Large language models rely on memorized patterns rather than genuine strategic reasoning when game payoffs and action labels are altered.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates LLMs playing repeated rounds of Prisoner's Dilemma and Rock-Paper-Scissors, then tests the same models on new versions where the payoff numbers are changed and the action names are replaced. These alterations are designed to remove familiar patterns that might have appeared in training data. The results show that models have difficulty adjusting their choices to the new incentives, fail to generalize across the changed structures, and perform worse at strategic play overall. A reader would care because the findings question whether LLMs can handle strategic decisions outside of well-known examples.

Core claim

LLM strategic performance in repeated game-theoretic settings reflects reliance on memorized patterns rather than genuine reasoning, as shown by limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.

What carries the argument

Counterfactual variants of Prisoner's Dilemma and Rock-Paper-Scissors created by altering payoff structures and action labels to break familiar symmetries and dominance relations.

If this is right

LLMs display limited sensitivity to changes in game incentives.
They exhibit poor structural generalization to modified game setups.
Strategic reasoning is impaired in environments that deviate from standard forms.
Overall performance indicates pattern memorization over adaptive reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar counterfactual tests could be applied to other decision tasks such as negotiation or resource allocation.
The results suggest that increasing variety in training examples of game-like interactions might improve adaptability.
If the pattern holds, LLMs would have restricted reliability in novel competitive or cooperative settings.

Load-bearing premise

The introduced counterfactual variants of PD and RPS successfully isolate genuine strategic reasoning from memorized patterns by altering payoff structures and action labels.

What would settle it

LLMs achieving comparable performance and adaptation in the counterfactual game variants to their performance in the standard versions would falsify the claim of reliance on memorized patterns.

read the original abstract

We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner's Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript evaluates LLMs on repeated Prisoner's Dilemma (PD) and Rock-Paper-Scissors (RPS) games. It introduces counterfactual variants that alter payoff structures and action labels to disrupt familiar dominance relations and symmetries. The central claim is that observed strategic performance reflects reliance on memorized patterns rather than genuine reasoning, as shown by limitations in incentive sensitivity, structural generalization, and strategic reasoning within the counterfactual environments. A multi-metric evaluation framework is used to compare default and counterfactual instantiations.

Significance. If the empirical results and controls hold, the work would offer a targeted test distinguishing memorization from counterfactual strategic reasoning in LLMs, with relevance to AI safety and decision-making applications. The design directly targets the distinction between pattern matching and incentive-driven play. However, the provided text contains only the abstract and no methods, data, quantitative results, or error analysis, preventing assessment of whether the counterfactuals isolate the intended construct or whether the reported limitations are robust.

major comments (1)

[Abstract / Evaluation Framework] The manuscript provides no methods section, model specifications, prompt templates, number of trials, or quantitative metrics (e.g., cooperation rates, win percentages, or statistical tests). Without these, it is impossible to determine whether the counterfactual variants produce the claimed drops in performance or whether the evaluation framework supports the central claim that limitations reflect memorization rather than reasoning.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We acknowledge that the version under consideration contained only the abstract and will expand the manuscript to include the requested details.

read point-by-point responses

Referee: [Abstract / Evaluation Framework] The manuscript provides no methods section, model specifications, prompt templates, number of trials, or quantitative metrics (e.g., cooperation rates, win percentages, or statistical tests). Without these, it is impossible to determine whether the counterfactual variants produce the claimed drops in performance or whether the evaluation framework supports the central claim that limitations reflect memorization rather than reasoning.

Authors: We agree that the submitted version contained only the abstract and therefore lacked the methods, model specifications, prompt templates, trial counts, and quantitative results needed for evaluation. The revised manuscript will add a dedicated Methods section that specifies the LLMs tested, exact prompt templates, number of independent trials per condition, cooperation and win-rate metrics, and the statistical tests used to compare default versus counterfactual conditions. This will allow direct assessment of whether performance drops are robust and whether the multi-metric framework isolates memorization from reasoning. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical evaluation study that introduces counterfactual variants of PD and RPS to test LLM behavior, with no mathematical derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations. The central claim about memorization vs. reasoning is supported by direct comparison of performance metrics across conditions, which is externally falsifiable and does not reduce to any self-referential construction. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5617 in / 896 out tokens · 20517 ms · 2026-05-25T06:32:05.094601+00:00 · methodology

Evaluating Counterfactual Strategic Reasoning in Large Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)