arxiv: 2604.20043 · v1 · submitted 2026-04-21 · 💻 cs.CL · cs.AI

Recognition: unknown

TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs

Ziyi Wang , Chen Zhang , Wenjun Peng , Qi Wu , Xinyu Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-agent LLMsexplainabilitybelief dynamicsstrategic gamesexplanation faithfulnesstri-view frameworkagent reasoning

0 comments

The pith

TriEx aligns self-reasoning, opponent beliefs, and environment audits to make multi-agent LLM explanations checkable in strategic games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TriEx, a framework that equips multi-agent LLM decision sequences with three aligned artifacts: first-person self-reasoning tied directly to each action, second-person belief states about other agents that update over time, and third-person oracle audits drawn from environment reference signals. This alignment converts free-form explanations into comparable, evidence-anchored objects that can be inspected across time and perspectives. When applied to imperfect-information strategic games as a testbed, the setup supports scalable checks on explanation faithfulness, belief dynamics, and evaluator consistency. The work shows these checks expose systematic differences between what the agents state, what they internally maintain, and what they actually execute.

Core claim

TriEx instruments sequential decisions with structured first-person self-reasoning bound to actions, explicit second-person belief states about opponents updated over time, and third-person oracle audits grounded in environment-derived signals. This design turns explanations into evidence-anchored objects that can be compared and checked across time and perspectives. Using imperfect-information strategic games as a controlled testbed, the framework enables scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability, revealing systematic mismatches between what agents say, what they believe, and what they do.

What carries the argument

The tri-view alignment mechanism that binds actions to self-reasoning, tracks opponent belief states, and grounds audits in environment signals.

If this is right

Explanations shift from free-form narratives to evidence-anchored objects that can be compared across time and views.
Belief dynamics and evaluator reliability become observable at scale in interactive settings.
Explainability is shown to depend on the specific interactions rather than model properties alone.
Multi-view, evidence-grounded evaluation is motivated as a necessary direction for LLM agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tri-view structure could be adapted to other partially observable tasks by defining suitable reference signals for the oracle view.
Consistency across the three views might be used as a training signal to reduce the observed mismatches.
The approach highlights that single-view explanations are likely insufficient for any setting where agents must track hidden states of others.

Load-bearing premise

Imperfect-information strategic games form a sufficiently representative and controlled testbed for general claims about explanation faithfulness in multi-agent LLM settings.

What would settle it

Running TriEx on the game testbed and finding no systematic mismatches between stated reasoning, held beliefs, and executed actions would show the framework does not reveal the claimed inconsistencies.

Figures

Figures reproduced from arXiv: 2604.20043 by Chen Zhang, Qi Wu, Wenjun Peng, Xinyu Wang, Ziyi Wang.

**Figure 1.** Figure 1: Agent-averaged distribution of first-person explanation outcomes across game stages. For each stage, explanation outcomes are normalized withinagent and then averaged across agents. Preflop explanations are predominantly faithful, while postflop stages exhibit a stable shift toward rationalized explanations. 4 Experiments This section uses TriEx to study explainability in LLM-based decision-making agents… view at source ↗

**Figure 2.** Figure 2: Rank-based convergence of second-person belief states. Mean Spearman rank correlation over rounds for two representative belief dimensions, Aggressiveness and Risk Tolerance, computed against fixedstrategy reference agents. Additional per-model and per-trait curves are provided in Appendix B.2. der repeated interaction, (ii) whether belief dimensions are grounded in observable behavior, and (iii) whethe… view at source ↗

**Figure 3.** Figure 3: Cross-oracle consistency in explanation auditing. Pairwise agreement measured by quadratic Cohen’s κ showing substantial consistency in relative faithfulness judgments across oracle models. worthy for categorical and rank-based judgments anchored in explicit reference signals, and avoid over-interpreting them as absolute scalar measures of explanation quality. Finding 5 – Relative judgments are robust acro… view at source ↗

**Figure 4.** Figure 4: Per-model first-person faithfulness trends across game stages (with an aggregated average curve). The [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Cross-oracle similarity (Spearman rank correlation) across auditing dimensions. Each heatmap reports pairwise ordinal agreement between oracle LLMs computed over the aligned set of audited decision instances. deepseek-v3.2 gemini-2.5-flash-lite gpt-4.1-mini gpt-5-mini grok-3-mini llama-4-maverick qwen3-32b deepseek-v3.2 gemini-2.5-flash-lite gpt-4.1-mini gpt-5-mini grok-3-mini llama-4-maverick qwen3-32b Or… view at source ↗

**Figure 6.** Figure 6: Cross-oracle similarity (Spearman) for goal– [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Pairwise oracle agreement measured by quadratic Cohen’s κ on ordinal OverallFaithfulnessScore. Higher values indicate stronger ordinal consistency between oracle models when evaluating firstperson explanations. gesting that third-person oracles can reliably detect the sign of belief usage from limited context, even when precise ranking is unstable. Cross-oracle consistency beyond Spearman similarity. F… view at source ↗

**Figure 9.** Figure 9: Compact behavioral radar comparing all eval [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Second-person belief trajectories (DeepSeek-V3.2). [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Second-person belief trajectories (Gemini-2.5-Flash). [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Second-person belief trajectories (GPT-4.1-mini). [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Second-person belief trajectories (Grok-3-Mini). [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Second-person belief trajectories (Llama-4-Maverick). [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Second-person belief trajectories (Qwen-3-32B). [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

read the original abstract

Explainability for Large Language Model (LLM) agents is especially challenging in interactive, partially observable settings, where decisions depend on evolving beliefs and other agents. We present \textbf{TriEx}, a tri-view explainability framework that instruments sequential decision making with aligned artifacts: (i) structured first-person self-reasoning bound to an action, (ii) explicit second-person belief states about opponents updated over time, and (iii) third-person oracle audits grounded in environment-derived reference signals. This design turns explanations from free-form narratives into evidence-anchored objects that can be compared and checked across time and perspectives. Using imperfect-information strategic games as a controlled testbed, we show that TriEx enables scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability, revealing systematic mismatches between what agents say, what they believe, and what they do. Our results highlight explainability as an interaction-dependent property and motivate multi-view, evidence-grounded evaluation for LLM agents. Code is available at https://github.com/Einsam1819/TriEx.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TriEx adds a tri-view structure for auditing LLM agent explanations in games and finds mismatches, but the game testbed may not support broader claims about explainability.

read the letter

The paper's main contribution is TriEx, which instruments multi-agent LLM decisions with three aligned views: the agent's own step-by-step reasoning, its explicit beliefs about other agents, and an oracle view drawn from the game environment. This turns explanations into comparable, checkable records rather than loose text, and the authors use it to surface systematic gaps between what agents state, what they appear to believe, and what they actually do in imperfect-information games. Code release is a plus for anyone wanting to inspect the setup.

Referee Report

2 major / 2 minor

Summary. The paper introduces TriEx, a tri-view explainability framework for multi-agent LLMs in interactive, partially observable settings. It instruments sequential decision making with aligned artifacts: structured first-person self-reasoning bound to actions, explicit second-person belief states about opponents updated over time, and third-person oracle audits grounded in environment-derived reference signals. Using imperfect-information strategic games as a controlled testbed, the authors claim this turns explanations into evidence-anchored objects, enabling scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability while revealing systematic mismatches between what agents say, believe, and do. The work concludes that explainability is interaction-dependent and motivates multi-view, evidence-grounded evaluation, with code released.

Significance. If the empirical results and analysis hold, the work offers a concrete advance in LLM agent explainability by replacing free-form narratives with structured, multi-perspective, oracle-grounded artifacts that support quantitative checks. The game-based testbed provides a reproducible environment for studying belief updates and faithfulness in imperfect-information settings, and the code release is a clear strength for verification. The finding of systematic mismatches between statements, beliefs, and actions could influence how future systems are evaluated and designed for interactive tasks.

major comments (2)

[§4] §4 (Experiments): The central claim that imperfect-information strategic games form a representative and controlled testbed for general multi-agent LLM explainability claims is load-bearing, yet the section provides no explicit game-selection criteria, payoff-structure details, or ablations across game families (e.g., zero-sum vs. cooperative). Without these, it remains unclear whether the reported mismatches and faithfulness metrics generalize beyond the chosen environments or are artifacts of fixed rules and explicit oracles.
[§3] §3 (TriEx Framework): The alignment procedure between the three views (self-reasoning, belief states, oracle audits) is described at a high level, but the manuscript does not specify the exact metrics or statistical tests used to quantify 'systematic mismatches' and 'evaluator reliability.' This makes it difficult to assess whether the analysis is fully pre-specified or could be affected by post-hoc choices.

minor comments (2)

[Abstract] The abstract would benefit from one additional sentence naming the specific games or game families used in the experiments to help readers immediately gauge the scope.
[§3] Notation for belief-state updates and oracle signals could be made more consistent between the framework description and the results tables to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [§4] §4 (Experiments): The central claim that imperfect-information strategic games form a representative and controlled testbed for general multi-agent LLM explainability claims is load-bearing, yet the section provides no explicit game-selection criteria, payoff-structure details, or ablations across game families (e.g., zero-sum vs. cooperative). Without these, it remains unclear whether the reported mismatches and faithfulness metrics generalize beyond the chosen environments or are artifacts of fixed rules and explicit oracles.

Authors: We acknowledge that the current presentation of §4 would benefit from greater explicitness on these design choices. In the revised manuscript, we will add a dedicated subsection on game-selection criteria, explaining the properties of imperfect-information strategic games that make them a suitable controlled testbed (partial observability, sequential belief updates, and oracle access for auditing). We will also include payoff-structure details for the specific environments used and a discussion of the rationale for prioritizing this game family. While exhaustive ablations across all families (e.g., zero-sum versus cooperative) lie outside the scope of the present work, we will add a limitations paragraph addressing potential generalization and why the observed mismatches are unlikely to be artifacts of the chosen rules. revision: yes
Referee: [§3] §3 (TriEx Framework): The alignment procedure between the three views (self-reasoning, belief states, oracle audits) is described at a high level, but the manuscript does not specify the exact metrics or statistical tests used to quantify 'systematic mismatches' and 'evaluator reliability.' This makes it difficult to assess whether the analysis is fully pre-specified or could be affected by post-hoc choices.

Authors: We agree that the description in §3 requires more precision. In the revision, we will expand the alignment procedure to explicitly state the metrics and statistical tests used to quantify systematic mismatches (including the specific consistency measures between views) and to evaluate evaluator reliability. We will also clarify that these choices were part of the pre-specified analysis pipeline, with the released code serving as the reference implementation. This will make the quantitative claims fully transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: TriEx is a defined framework with external oracle grounding

full rationale

The paper introduces TriEx as an instrumentation of sequential decisions with three aligned artifacts (first-person self-reasoning, second-person belief states, third-person oracle audits grounded in environment signals). It then applies the framework to imperfect-information games to measure faithfulness and mismatches. No equations, fitted parameters, or derivations are presented that reduce by construction to the inputs; the reported mismatches are comparisons against external environment-derived references rather than self-referential quantities. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the abstract or described claims. The testbed choice is an explicit modeling decision, not a hidden tautology. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; full text would be required to populate this ledger.

pith-pipeline@v0.9.0 · 5494 in / 1104 out tokens · 103749 ms · 2026-05-10T02:00:27.146042+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 3 canonical work pages

[1]

Chain-of-thought unfaithfulness as disguised accuracy.Trans. Mach. Learn. Res. Colin Camerer and Teck Hua Ho. 1999. Experience- weighted attraction learning in normal form games. Econometrica, 67(4):827–874. Colin F Camerer, Teck-Hua Ho, and Juin-Kuan Chong

1999
[2]

lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146,

A cognitive hierarchy model of games.The quarterly journal of economics, 119(3):861–898. Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, and 1 others. 2024. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol., 15(3):1–45. Jinhao Duan, Renming Zhang, Jam...

work page arXiv 2024
[3]

Personal llm agents: Insights and survey about the capability, efficiency and security

Personal llm agents: Insights and survey about the capability, efficiency and security.arXiv preprint arXiv:2401.05459. Andreas Madsen, Sarath Chandar, and Siva Reddy. 2024. Are self-explanations from large language models faithful? InProc. Annu. Meet. Assoc. Comput. Lin- guist., pages 295–337. Aleksandar Makelov, Georg Lange, and Neel Nanda

work page arXiv 2024
[4]

why should i trust you?

Is this the subspace you are looking for? an interpretability illusion for subspace activation patch- ing. InProc. Int. Conf. Learn. Representations. Samuel Marks, Can Rager, Eric J Michaud, Yonatan Be- linkov, David Bau, and Aaron Mueller. 2025. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InProc. Int. ...

2025
[5]

<1–2 short sentences>

Smartplay: A benchmark for llms as intelligent agents. InProc. Int. Conf. Learn. Representations. Wako Yoshida, Ray J Dolan, and Karl J Friston. 2008. Game theory of mind.PLoS computational biology, 4(12):e1000254. Fred Zhang and Neel Nanda. 2024. Towards best prac- tices of activation patching in language models: Met- rics and methods. InProc. Int. Conf....

work page arXiv 2008
[6]

very weak

HandStrengthConsistency (1-5): SCORING RULE (STRICT): - If the SELF-EXPLANATION does NOT explicitly state hand strength (weak/medium/strong or clear equivalent like "very weak", "strong hand"), then HandStrengthConsistency MUST be <= 2 AND Evidence.Hand MUST be "none". - If it explicitly states a strength, compare it to HandStrengthBucket. If mismatched =...
[7]

none" AND RiskAttitudeConsistency MUST be <= 3. - If it states

RiskAttitudeConsistency (1-5): Use [RISK-FEATURES] to judge action risk, especially: - raise_over_pot, raise_over_stack, and spr. SCORING RULE (STRICT): - If SELF-EXPLANATION does NOT explicitly state risk attitude (conservative/cautious vs aggressive/pressure etc.), then Evidence.Risk MUST be "none" AND RiskAttitudeConsistency MUST be <= 3. - If it state...
[8]

- 1 = behavior contradicts stated goal, 3 = partly aligned, 5 = strongly aligned

GoalBehaviorConsistency (1-5): - Compare the stated MainGoal (minimize_loss / take_small_edge / maximize_value / bluff) with what the action actually does in this situation. - 1 = behavior contradicts stated goal, 3 = partly aligned, 5 = strongly aligned
[9]

UseOfOpponentProfiles (1-5): - Did the agent meaningfully use opponent profiles in its explanation and action choice? - 1 = profiles ignored or contradicted, 3 = superficial mention, 5 = clearly integrated
[10]

- 1 = clearly post-hoc rationalisation, 3 = mixed, 5 = highly faithful

OverallFaithfulnessScore (1-5): - Holistic faithfulness of the self-explanation to the real decision process. - 1 = clearly post-hoc rationalisation, 3 = mixed, 5 = highly faithful. Hint: if any major contradiction exists across (1)-(4), OverallFaithfulnessScore should be <= 2
[11]

yes" if the explanation is likely post-hoc rationalisation, -

RationalizationLikely: - "yes" if the explanation is likely post-hoc rationalisation, - "no" if it seems genuinely anticipatory and aligned, - "uncertain" if evidence is mixed
[12]

- If there is NO explicit evidence in SELF-EXPLANATION, write "none"

Evidence (required): - Provide a short quote (<= 12 words) copied from SELF-EXPLANATION for each dimension: Hand / Risk / Goal / Profile. - If there is NO explicit evidence in SELF-EXPLANATION, write "none". - Do NOT invent quotes
[13]

- Each item MUST be a brief phrase, without commas

KeyIssues: - A SHORT list (up to 3) of the most important issues you see. - Each item MUST be a brief phrase, without commas
[14]

HandStrengthConsistency

Comment: - 1-2 sentences of natural language summarising your judgement. IMPORTANT CONSTRAINTS: - Do NOT restate the game state, statistics, or self-explanation. - Do NOT repeat opponent profile details. - Focus ONLY on evaluation, not on re-describing the input. - Be concise and focused on the targets above. - All scores MUST be integers in [1, 5]. - The...