Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization

Ethan Fang; Junyi Liao; Vahid Tarokh; Zhuoran Yang; Zihan Zhu

arxiv: 2601.12707 · v2 · pith:WH2LSVAHnew · submitted 2026-01-19 · 💻 cs.LG · stat.ML

Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization

Junyi Liao , Zihan Zhu , Ethan Fang , Zhuoran Yang , Vahid Tarokh This is my paper

Pith reviewed 2026-05-21 14:50 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords inverse game theoryquantal response equilibriumentropy regularizationreward recoveryzero-sum gamesMarkov gamesinverse reinforcement learning

0 comments

The pith

Reward functions driving agent play in zero-sum games can be uniquely recovered from observed strategies when behaviors follow quantal response equilibrium with linear rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the inverse problem of recovering unknown rewards that explain observed actions in competitive two-player games, both in simple matrix settings and in dynamic Markov games. It shows that under entropy regularization the quantal response equilibrium makes these rewards identifiable when they are linear in a known set of features. A new algorithm then estimates the rewards from limited action data, with proofs of consistency and sample efficiency. This matters because many real decision problems in economics, AI, and security are zero-sum competitions where we cannot directly observe payoffs but can watch play. If the identifiability result holds, downstream tasks such as predicting future behavior or designing incentives become better grounded.

Core claim

We establish that the reward functions of two-player zero-sum matrix games and Markov games are identifiable from observed player strategies and actions when the data are generated by the quantal response equilibrium under linear reward assumptions. Building on this, we propose a unified algorithm that recovers the rewards in both static and dynamic settings and admits strong theoretical guarantees on reliability and sample efficiency.

What carries the argument

The quantal response equilibrium (QRE) under entropy regularization, which maps linear reward functions to unique equilibrium strategy distributions and thereby removes the usual non-uniqueness of inverse game problems.

If this is right

The same linear-QRE framework recovers rewards in both one-shot matrix games and multi-stage Markov games.
The estimator remains consistent and sample-efficient even with partial observation of actions.
The method can be instantiated with maximum-likelihood estimation or other loss functions while retaining the identifiability guarantee.
Recovered rewards can be used to predict future play or to design interventions in competitive environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested by fitting rewards to large-scale game logs from chess, poker, or online auctions and then checking out-of-sample prediction accuracy.
If the linearity assumption is relaxed to a known basis expansion or kernel, the same identifiability argument may extend to richer reward classes.
The framework suggests a route to reward inference in non-zero-sum or multi-player settings once an appropriate equilibrium notion replaces QRE.

Load-bearing premise

Observed player strategies and actions are generated exactly according to the quantal response equilibrium with reward functions that are linear in a known feature basis.

What would settle it

Generate synthetic data from a zero-sum game whose true rewards are nonlinear in the chosen features, run the algorithm, and check whether it returns a reward vector that is statistically far from the true one or fails to predict held-out actions.

read the original abstract

Estimating the unknown reward functions driving agents' behaviors is of central interest in inverse reinforcement learning and game theory. To tackle this problem, we develop a unified framework for reward function recovery in two-player zero-sum matrix games and Markov games with entropy regularization, where we aim to reconstruct the underlying reward functions given observed players' strategies and actions. This task is challenging due to the inherent ambiguity of inverse problems, the non-uniqueness of feasible rewards, and limited observational data coverage. To address these challenges, we establish the reward function's identifiability using the quantal response equilibrium (QRE) under linear assumptions. Building upon this theoretical foundation, we propose a novel algorithm to learn reward functions from observed actions. Our algorithm works in both static and dynamic settings and is adaptable to incorporate different methods, such as Maximum Likelihood Estimation (MLE). We provide strong theoretical guarantees for the reliability and sample efficiency of our algorithm. Further, we conduct extensive numerical studies to demonstrate the practical effectiveness of the proposed framework, offering new insights into decision-making in competitive environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a unified framework for recovering unknown reward functions in two-player zero-sum matrix games and Markov games under entropy regularization. It establishes identifiability of the reward function via the quantal response equilibrium (QRE) under the assumption that rewards are linear in a known feature basis, proposes an MLE-adaptable algorithm for learning from observed strategies and actions in both static and dynamic settings, provides theoretical guarantees on reliability and sample efficiency, and validates the approach through numerical experiments.

Significance. If the identifiability result and sample-efficiency guarantees hold under the stated assumptions, the work would provide a principled approach to inverse game theory problems, addressing ambiguity and limited coverage in competitive settings and offering a bridge between QRE-based modeling and practical reward inference in RL and game-theoretic applications.

major comments (2)

[§3] §3 (Theoretical Foundation): The identifiability claim (e.g., Theorem 1) is derived under the exact-QRE generation assumption with linear rewards in a known basis; the manuscript provides no robustness analysis or error bounds for misspecification of the feature basis or deviations from QRE, which is load-bearing for the practical-effectiveness claim in the abstract and numerical studies section.
[§4] §4 (Algorithm and Guarantees): The sample-efficiency bound for the MLE-based procedure assumes perfect coverage and exact equilibrium play; without explicit dependence on the entropy coefficient or analysis of finite-sample deviation from QRE, the guarantee does not directly support the 'strong theoretical guarantees' asserted for realistic observational data.

minor comments (2)

Notation for the entropy regularization coefficient is introduced without a dedicated symbol table; consistent use across equations would improve readability.
The numerical studies section would benefit from an explicit ablation on feature-basis completeness to illustrate sensitivity to the linear assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate planned revisions to clarify assumptions and strengthen the presentation of limitations.

read point-by-point responses

Referee: [§3] §3 (Theoretical Foundation): The identifiability claim (e.g., Theorem 1) is derived under the exact-QRE generation assumption with linear rewards in a known basis; the manuscript provides no robustness analysis or error bounds for misspecification of the feature basis or deviations from QRE, which is load-bearing for the practical-effectiveness claim in the abstract and numerical studies section.

Authors: We agree that Theorem 1 establishes identifiability strictly under exact QRE and a correctly specified linear feature basis. This is the standard modeling assumption in the inverse game theory literature for obtaining clean identifiability results. The numerical experiments are conducted under the same generative model. To address the concern, we will add a dedicated paragraph in the discussion section on sensitivity to feature misspecification and small QRE deviations, together with additional simulation results that perturb the basis or add noise to observed strategies. These changes will be incorporated in the revision. revision: partial
Referee: [§4] §4 (Algorithm and Guarantees): The sample-efficiency bound for the MLE-based procedure assumes perfect coverage and exact equilibrium play; without explicit dependence on the entropy coefficient or analysis of finite-sample deviation from QRE, the guarantee does not directly support the 'strong theoretical guarantees' asserted for realistic observational data.

Authors: The sample-complexity bounds in Section 4 are derived under the stated assumptions of perfect coverage and exact QRE, which are made explicit in the theorem statements. These assumptions enable the current analysis; the entropy coefficient appears in the concentration terms but is not isolated as a separate parameter in the final rate. We will revise the text to (i) explicitly restate the assumptions when citing the guarantees, (ii) highlight the role of the entropy parameter in the bounds, and (iii) add a remark noting that extensions to approximate QRE or imperfect coverage constitute future work. This will temper the language in the abstract and introduction accordingly. revision: partial

Circularity Check

0 steps flagged

No significant circularity: identifiability derived from QRE properties and linear parameterization

full rationale

The paper's central theoretical step establishes reward identifiability from the quantal response equilibrium (QRE) under the assumption that observed strategies are generated exactly according to entropy-regularized QRE with rewards linear in a known feature basis. This is a conditional result resting on explicit model assumptions rather than any reduction of the claimed identifiability to a fitted quantity defined by the same data or to a self-citation chain. The subsequent MLE-adaptable algorithm and sample-efficiency guarantees are presented as consequences of this foundation, with no indication that the derivation renames a known result, smuggles an ansatz via citation, or imports uniqueness from prior author work as an unverified external fact. The framework remains self-contained against external benchmarks once the QRE and linearity assumptions are granted.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions standard in game theory and inverse RL; linearity is introduced to resolve non-uniqueness and is not independently verified in the abstract.

free parameters (1)

entropy regularization coefficient
Controls the strength of entropy regularization in the equilibrium and learning objective.

axioms (2)

domain assumption Reward functions are linear in a known feature basis
Invoked to establish identifiability from QRE (abstract section on theoretical foundation).
domain assumption Observed actions are generated by quantal response equilibrium
Central modeling choice for the inverse problem setup.

pith-pipeline@v0.9.0 · 5723 in / 1249 out tokens · 47858 ms · 2026-05-21T14:50:07.912564+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We establish the reward function's identifiability using the quantal response equilibrium (QRE) under linear assumptions... rank condition that rank[A(ν*) B(μ*)] = d

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Strategic Coercion Within Alliances: The Greenland Sovereignty Game as an AI Stress Test
physics.soc-ph 2026-05 unverdicted novelty 5.0

Frontier LLMs in simulated Greenland sovereignty games escalate more under coercion framing, differ by origin when playing the US role, and achieve peaceful US acquisition in only 1.9% of clean games.