Persistent-Transient Policy Evaluation for Markov Chains via Minimal Peripheral Quotients

Vaneet Aggarwal; Yang Xu

arxiv: 2602.00474 · v2 · submitted 2026-01-31 · 📊 stat.ML · cs.LG· cs.NA· math.NA

Persistent-Transient Policy Evaluation for Markov Chains via Minimal Peripheral Quotients

Yang Xu , Vaneet Aggarwal This is my paper

Pith reviewed 2026-05-16 09:20 UTC · model grok-4.3

classification 📊 stat.ML cs.LGcs.NAmath.NA

keywords Markov chainspolicy evaluationperipheral invariant subspacequotient methodspersistent transient decompositiongain bias analysisreducible chainsperiodic chains

0 comments

The pith

Quotienting Markov chains by their real peripheral invariant subspace yields a unique reward decomposition that cleanly separates persistent regime profiles from gauge-fixed transient components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Classical gain-bias methods for policy evaluation in finite Markov chains mix persistent phase-dependent behavior into the bias term together with genuine transients, especially when chains are reducible or periodic. The paper identifies the real peripheral invariant subspace K(P) of the transition matrix as the source of this mixing. Quotienting by K(P) produces the minimal exact quotient that eliminates all non-decaying modes and renders the remaining dynamics strictly stable. With a gauge projection Π whose kernel is exactly K(P), the reward then decomposes uniquely as r equals g_Π star plus (I minus P) times v_Π star, where g_Π star holds every persistent mode and v_Π star holds only decaying effects. A reader would care because the split restores exact reconstruction of finite-horizon returns and statewise averages while supporting stable estimation from samples.

Core claim

After choosing a gauge projection Π with kernel K(P), the reward admits a unique decomposition r = g_Π^* + (I-P)v_Π^*, where g_Π^* is a persistent regime profile and v_Π^* is a gauge-fixed transient component. Quotienting by K(P) is the minimal exact quotient that removes all non-decaying modes. An exact comparison with classical normalized gain and bias shows that the new pair reallocates the same information so that all persistent modes appear in g_Π^* while v_Π^* remains transient. This decomposition reconstructs finite-horizon returns, recovers statewise average reward, admits a transient-cost interpretation, and yields a stable estimator under a generative model.

What carries the argument

The minimal peripheral quotient obtained by projecting out the real peripheral invariant subspace K(P) of the transition matrix P, via a gauge projection Π whose kernel is exactly K(P).

If this is right

Finite-horizon returns can be reconstructed exactly from the separated components.
Statewise average reward is recovered without absorbing persistent phases into bias terms.
The transient component admits a direct cost interpretation that decays under the quotient dynamics.
Estimation remains stable when data are generated by sampling the chain.
The classical gain-bias pair is recovered as a reallocation of the same information with persistent modes moved into the new g term.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation may allow reinforcement-learning algorithms to maintain separate models for long-term regime identification and short-term transient planning.
In environments with known periodicity, the persistent profile could be used to schedule phase-aware actions without averaging over cycles.
The same quotient construction might extend to approximate settings by estimating the peripheral subspace from sampled trajectories.

Load-bearing premise

There exists a gauge projection whose kernel is precisely the real peripheral invariant subspace K(P) and whose quotient removes every non-decaying mode while still allowing exact reconstruction of returns from the resulting components.

What would settle it

Construct a small periodic or reducible chain, compute the classical gain-bias pair and the proposed g_Π^* plus (I-P)v_Π^* pair, and check whether only the new pair produces exact equality between the sum of discounted rewards over a long finite horizon and the value obtained by solving the original Bellman equation.

read the original abstract

We study fixed-policy evaluation for finite Markov chains that may be reducible and periodic. Classical evaluation methods with gain and bias decomposition are not always diagnostic: the gain records only invariant Ces\`aro averages, while persistent phase-dependent behavior is absorbed into the bias together with genuinely transient effects. We identify the real peripheral invariant subspace $\mathcal{K}(P)$ of the transition matrix $P$ as the source of this ambiguity. Quotienting by $\mathcal{K}(P)$ is the minimal exact quotient that removes all non-decaying modes and makes the remaining dynamics strictly stable. After choosing a gauge projection $\Pi$ with kernel $\mathcal{K}(P)$, the reward admits a unique decomposition $r = g_\Pi^\star + (I-P)v_\Pi^\star$, where $g_\Pi^\star$ is a persistent regime profile and $v_\Pi^\star$ is a gauge-fixed transient component. An exact comparison with classical normalized gain and bias shows that the new pair reallocates the same information so that all persistent modes are represented in $g_\Pi^\star$ and $v_\Pi^\star$ is transient. This decomposition reconstructs finite-horizon returns, recovers statewise average reward, admits a transient-cost interpretation, and yields a stable estimator under a generative model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a useful new decomposition for handling persistent and transient effects in reducible Markov chains by quotienting out the peripheral subspace, but the transience of the second term requires the gauge projection to land on the stable invariant complement rather than an arbitrary one.

read the letter

The main takeaway is that this work supplies a cleaner split between persistent regime profiles and transient components for policy evaluation on finite Markov chains that may be reducible or periodic. Classical gain-bias methods fold phase-dependent persistent behavior into the bias term, which hurts both interpretation and finite-horizon reconstruction. The authors isolate the real peripheral invariant subspace K(P) as the source of the ambiguity and show that quotienting by it yields strictly stable remaining dynamics. After fixing a gauge projection Π with kernel K(P), they obtain the decomposition r = g_Π^* + (I-P)v_Π^* that reallocates the same total information so g_Π^* holds all non-decaying modes and v_Π^* decays. The claims about exact return reconstruction, statewise average reward recovery, and a stable generative-model estimator follow from this split. That is the concrete advance over prior decompositions. The construction is defined directly from the spectral properties of P, so it avoids circularity or fitted parameters. The stress-test concern is on point: an arbitrary complement S to K(P) can embed persistent modes into the lift v_Π^*, so ||P^n v_Π^*|| need not go to zero. The abstract phrasing does not explicitly restrict the image of Π to the canonical stable subspace, which is the only choice that guarantees transience. If the full paper pins this down with the right invariant complement and supplies the uniqueness and reconstruction proofs, the central claim stands; otherwise the transient-cost interpretation weakens. The paper is aimed at readers working on average-reward RL and stochastic control for general chains. Anyone building estimators or analyzing finite-horizon performance would get direct value from the reconstruction properties. I would send it to peer review. The problem is real, the proposed fix is new, and the math looks worth checking in detail even if the projection choice needs tightening.

Referee Report

3 major / 3 minor

Summary. The paper studies fixed-policy evaluation for finite Markov chains that may be reducible or periodic. It identifies the real peripheral invariant subspace K(P) of the transition matrix P as the source of ambiguity in classical gain-bias decompositions. After selecting a gauge projection Π with kernel exactly K(P), the reward admits the decomposition r = g_Π^* + (I-P)v_Π^*, where g_Π^* is a persistent regime profile and v_Π^* is asserted to be a gauge-fixed transient component. The construction is claimed to be the minimal exact quotient removing all non-decaying modes, to reconstruct finite-horizon returns exactly, to recover statewise average reward, to admit a transient-cost interpretation, and to yield a stable estimator under a generative model.

Significance. If the central claims hold, the work supplies a cleaner separation of persistent and transient effects than classical normalized gain-bias methods, especially for periodic or reducible chains. The quotient construction and the explicit reconstruction property would be useful for both theoretical analysis and practical estimation in reinforcement learning settings where long-run averages alone are insufficient.

major comments (3)

[§3] §3 (gauge projection definition): The claim that v_Π^* is transient for any projection Π with ker(Π)=K(P) is not automatically true. Let E_per = K(P) and E_stab the complementary P-invariant stable subspace. Any complement S ≠ E_stab yields a splitting in which generic vectors in im(Π) retain a nonzero E_per component; the peripheral action on that component prevents ||P^n v_Π^*|| → 0. The transient-cost interpretation and finite-horizon reconstruction therefore require an additional, unstated restriction that im(Π)=E_stab. This is load-bearing for the central claims.
[§4.1] §4.1 (uniqueness and exactness): The uniqueness statement for the pair (g_Π^*, v_Π^*) is asserted after fixing Π, yet the proof sketch does not address whether different valid complements produce different transient components or whether the reconstruction identity holds uniformly. A concrete counter-example with a two-state periodic chain would clarify whether the quotient is truly minimal and exact for all admissible Π.
[§5] §5 (comparison with classical gain-bias): The reallocation argument that all persistent modes move into g_Π^* while v_Π^* becomes purely transient is stated without an explicit change-of-basis calculation between the classical bias and the new v_Π^*. The spectral-radius claim for the quotient operator needs a short lemma showing that the induced map on V/K(P) has spectral radius strictly less than 1.

minor comments (3)

[Preliminaries] The notation K(P) for the real peripheral invariant subspace is introduced without a reference to standard texts on nonnegative matrix theory (e.g., Seneta or Berman-Plemmons). A one-sentence citation would help readers.
[Figure 1] Figure 1 (spectral diagram) uses color coding that is not explained in the caption; the distinction between peripheral and stable modes should be stated explicitly.
[§5] The abstract claims 'exact comparison with classical normalized gain and bias' but the manuscript never writes the classical pair explicitly alongside (g_Π^*, v_Π^*). Adding a short side-by-side display would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and valuable suggestions. The points raised help clarify the gauge choice and strengthen the presentation. We respond to each major comment and will update the manuscript with the necessary additions and corrections.

read point-by-point responses

Referee: [§3] §3 (gauge projection definition): The claim that v_Π^* is transient for any projection Π with ker(Π)=K(P) is not automatically true. Let E_per = K(P) and E_stab the complementary P-invariant stable subspace. Any complement S ≠ E_stab yields a splitting in which generic vectors in im(Π) retain a nonzero E_per component; the peripheral action on that component prevents ||P^n v_Π^*|| → 0. The transient-cost interpretation and finite-horizon reconstruction therefore require an additional, unstated restriction that im(Π)=E_stab. This is load-bearing for the central claims.

Authors: We agree with this observation. The transience property holds if and only if im(Π) coincides with the P-invariant stable subspace E_stab. We will revise the definition in §3 to specify that the gauge projection Π is the unique projection onto E_stab with kernel K(P). This makes the choice canonical and ensures that all vectors in im(Π) lie in E_stab, hence are transient under iteration of P. The quotient construction itself remains unchanged and minimal. revision: yes
Referee: [§4.1] §4.1 (uniqueness and exactness): The uniqueness statement for the pair (g_Π^*, v_Π^*) is asserted after fixing Π, yet the proof sketch does not address whether different valid complements produce different transient components or whether the reconstruction identity holds uniformly. A concrete counter-example with a two-state periodic chain would clarify whether the quotient is truly minimal and exact for all admissible Π.

Authors: With the revised definition of Π in §3, the complement is uniquely determined as the invariant stable subspace, so the pair (g_Π^*, v_Π^*) is unique. We will add a concrete example using a two-state deterministic periodic chain to demonstrate that non-invariant complements lead to non-transient v, while the invariant choice yields exact finite-horizon reconstruction and confirms minimality of the quotient. revision: yes
Referee: [§5] §5 (comparison with classical gain-bias): The reallocation argument that all persistent modes move into g_Π^* while v_Π^* becomes purely transient is stated without an explicit change-of-basis calculation between the classical bias and the new v_Π^*. The spectral-radius claim for the quotient operator needs a short lemma showing that the induced map on V/K(P) has spectral radius strictly less than 1.

Authors: We will include an explicit change-of-basis calculation in §5 comparing the classical bias to v_Π^*, showing the reallocation of persistent modes to g_Π^*. We will also add a short lemma establishing that the induced operator on the quotient V/K(P) has spectral radius strictly less than 1, following from the fact that its spectrum is that of P restricted to E_stab. revision: yes

Circularity Check

0 steps flagged

Derivation is self-contained with no circular reductions

full rationale

The paper defines the peripheral subspace K(P) from the spectral properties of the transition matrix P, selects a gauge projection Π with that kernel, and constructs the decomposition r = g_Π^* + (I-P)v_Π^* directly from these objects. The uniqueness and transience claims follow from the algebraic properties of the quotient and the choice of Π rather than from any fitted parameter, self-referential definition, or load-bearing self-citation. No step equates a derived quantity to its own input by construction, and the comparison to classical gain-bias methods reallocates existing information without circularity. The derivation is therefore independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on standard linear-algebraic properties of finite stochastic matrices and introduces one new projection operator to achieve uniqueness.

axioms (2)

domain assumption Finite Markov chains possess a well-defined peripheral invariant subspace K(P) that contains all non-decaying modes of the transition matrix P.
Invoked to justify that quotienting by K(P) removes every non-decaying component.
ad hoc to paper A gauge projection Π with kernel exactly K(P) exists and yields a unique decomposition of the reward.
Central to the uniqueness claim; introduced in the abstract without further justification.

invented entities (1)

Gauge projection Π no independent evidence
purpose: To fix the gauge so that the decomposition r = g_Π^* + (I-P)v_Π^* is unique.
New operator defined by the paper to separate persistent and transient parts.

pith-pipeline@v0.9.0 · 5536 in / 1444 out tokens · 65235 ms · 2026-05-16T09:20:59.073822+00:00 · methodology

Persistent-Transient Policy Evaluation for Markov Chains via Minimal Peripheral Quotients

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)