pith. sign in

arxiv: 2602.00474 · v2 · submitted 2026-01-31 · 📊 stat.ML · cs.LG· cs.NA· math.NA

Persistent-Transient Policy Evaluation for Markov Chains via Minimal Peripheral Quotients

Pith reviewed 2026-05-16 09:20 UTC · model grok-4.3

classification 📊 stat.ML cs.LGcs.NAmath.NA
keywords Markov chainspolicy evaluationperipheral invariant subspacequotient methodspersistent transient decompositiongain bias analysisreducible chainsperiodic chains
0
0 comments X

The pith

Quotienting Markov chains by their real peripheral invariant subspace yields a unique reward decomposition that cleanly separates persistent regime profiles from gauge-fixed transient components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Classical gain-bias methods for policy evaluation in finite Markov chains mix persistent phase-dependent behavior into the bias term together with genuine transients, especially when chains are reducible or periodic. The paper identifies the real peripheral invariant subspace K(P) of the transition matrix as the source of this mixing. Quotienting by K(P) produces the minimal exact quotient that eliminates all non-decaying modes and renders the remaining dynamics strictly stable. With a gauge projection Π whose kernel is exactly K(P), the reward then decomposes uniquely as r equals g_Π star plus (I minus P) times v_Π star, where g_Π star holds every persistent mode and v_Π star holds only decaying effects. A reader would care because the split restores exact reconstruction of finite-horizon returns and statewise averages while supporting stable estimation from samples.

Core claim

After choosing a gauge projection Π with kernel K(P), the reward admits a unique decomposition r = g_Π^* + (I-P)v_Π^*, where g_Π^* is a persistent regime profile and v_Π^* is a gauge-fixed transient component. Quotienting by K(P) is the minimal exact quotient that removes all non-decaying modes. An exact comparison with classical normalized gain and bias shows that the new pair reallocates the same information so that all persistent modes appear in g_Π^* while v_Π^* remains transient. This decomposition reconstructs finite-horizon returns, recovers statewise average reward, admits a transient-cost interpretation, and yields a stable estimator under a generative model.

What carries the argument

The minimal peripheral quotient obtained by projecting out the real peripheral invariant subspace K(P) of the transition matrix P, via a gauge projection Π whose kernel is exactly K(P).

If this is right

  • Finite-horizon returns can be reconstructed exactly from the separated components.
  • Statewise average reward is recovered without absorbing persistent phases into bias terms.
  • The transient component admits a direct cost interpretation that decays under the quotient dynamics.
  • Estimation remains stable when data are generated by sampling the chain.
  • The classical gain-bias pair is recovered as a reallocation of the same information with persistent modes moved into the new g term.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation may allow reinforcement-learning algorithms to maintain separate models for long-term regime identification and short-term transient planning.
  • In environments with known periodicity, the persistent profile could be used to schedule phase-aware actions without averaging over cycles.
  • The same quotient construction might extend to approximate settings by estimating the peripheral subspace from sampled trajectories.

Load-bearing premise

There exists a gauge projection whose kernel is precisely the real peripheral invariant subspace K(P) and whose quotient removes every non-decaying mode while still allowing exact reconstruction of returns from the resulting components.

What would settle it

Construct a small periodic or reducible chain, compute the classical gain-bias pair and the proposed g_Π^* plus (I-P)v_Π^* pair, and check whether only the new pair produces exact equality between the sum of discounted rewards over a long finite horizon and the value obtained by solving the original Bellman equation.

read the original abstract

We study fixed-policy evaluation for finite Markov chains that may be reducible and periodic. Classical evaluation methods with gain and bias decomposition are not always diagnostic: the gain records only invariant Ces\`aro averages, while persistent phase-dependent behavior is absorbed into the bias together with genuinely transient effects. We identify the real peripheral invariant subspace $\mathcal{K}(P)$ of the transition matrix $P$ as the source of this ambiguity. Quotienting by $\mathcal{K}(P)$ is the minimal exact quotient that removes all non-decaying modes and makes the remaining dynamics strictly stable. After choosing a gauge projection $\Pi$ with kernel $\mathcal{K}(P)$, the reward admits a unique decomposition $r = g_\Pi^\star + (I-P)v_\Pi^\star$, where $g_\Pi^\star$ is a persistent regime profile and $v_\Pi^\star$ is a gauge-fixed transient component. An exact comparison with classical normalized gain and bias shows that the new pair reallocates the same information so that all persistent modes are represented in $g_\Pi^\star$ and $v_\Pi^\star$ is transient. This decomposition reconstructs finite-horizon returns, recovers statewise average reward, admits a transient-cost interpretation, and yields a stable estimator under a generative model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper studies fixed-policy evaluation for finite Markov chains that may be reducible or periodic. It identifies the real peripheral invariant subspace K(P) of the transition matrix P as the source of ambiguity in classical gain-bias decompositions. After selecting a gauge projection Π with kernel exactly K(P), the reward admits the decomposition r = g_Π^* + (I-P)v_Π^*, where g_Π^* is a persistent regime profile and v_Π^* is asserted to be a gauge-fixed transient component. The construction is claimed to be the minimal exact quotient removing all non-decaying modes, to reconstruct finite-horizon returns exactly, to recover statewise average reward, to admit a transient-cost interpretation, and to yield a stable estimator under a generative model.

Significance. If the central claims hold, the work supplies a cleaner separation of persistent and transient effects than classical normalized gain-bias methods, especially for periodic or reducible chains. The quotient construction and the explicit reconstruction property would be useful for both theoretical analysis and practical estimation in reinforcement learning settings where long-run averages alone are insufficient.

major comments (3)
  1. [§3] §3 (gauge projection definition): The claim that v_Π^* is transient for any projection Π with ker(Π)=K(P) is not automatically true. Let E_per = K(P) and E_stab the complementary P-invariant stable subspace. Any complement S ≠ E_stab yields a splitting in which generic vectors in im(Π) retain a nonzero E_per component; the peripheral action on that component prevents ||P^n v_Π^*|| → 0. The transient-cost interpretation and finite-horizon reconstruction therefore require an additional, unstated restriction that im(Π)=E_stab. This is load-bearing for the central claims.
  2. [§4.1] §4.1 (uniqueness and exactness): The uniqueness statement for the pair (g_Π^*, v_Π^*) is asserted after fixing Π, yet the proof sketch does not address whether different valid complements produce different transient components or whether the reconstruction identity holds uniformly. A concrete counter-example with a two-state periodic chain would clarify whether the quotient is truly minimal and exact for all admissible Π.
  3. [§5] §5 (comparison with classical gain-bias): The reallocation argument that all persistent modes move into g_Π^* while v_Π^* becomes purely transient is stated without an explicit change-of-basis calculation between the classical bias and the new v_Π^*. The spectral-radius claim for the quotient operator needs a short lemma showing that the induced map on V/K(P) has spectral radius strictly less than 1.
minor comments (3)
  1. [Preliminaries] The notation K(P) for the real peripheral invariant subspace is introduced without a reference to standard texts on nonnegative matrix theory (e.g., Seneta or Berman-Plemmons). A one-sentence citation would help readers.
  2. [Figure 1] Figure 1 (spectral diagram) uses color coding that is not explained in the caption; the distinction between peripheral and stable modes should be stated explicitly.
  3. [§5] The abstract claims 'exact comparison with classical normalized gain and bias' but the manuscript never writes the classical pair explicitly alongside (g_Π^*, v_Π^*). Adding a short side-by-side display would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and valuable suggestions. The points raised help clarify the gauge choice and strengthen the presentation. We respond to each major comment and will update the manuscript with the necessary additions and corrections.

read point-by-point responses
  1. Referee: [§3] §3 (gauge projection definition): The claim that v_Π^* is transient for any projection Π with ker(Π)=K(P) is not automatically true. Let E_per = K(P) and E_stab the complementary P-invariant stable subspace. Any complement S ≠ E_stab yields a splitting in which generic vectors in im(Π) retain a nonzero E_per component; the peripheral action on that component prevents ||P^n v_Π^*|| → 0. The transient-cost interpretation and finite-horizon reconstruction therefore require an additional, unstated restriction that im(Π)=E_stab. This is load-bearing for the central claims.

    Authors: We agree with this observation. The transience property holds if and only if im(Π) coincides with the P-invariant stable subspace E_stab. We will revise the definition in §3 to specify that the gauge projection Π is the unique projection onto E_stab with kernel K(P). This makes the choice canonical and ensures that all vectors in im(Π) lie in E_stab, hence are transient under iteration of P. The quotient construction itself remains unchanged and minimal. revision: yes

  2. Referee: [§4.1] §4.1 (uniqueness and exactness): The uniqueness statement for the pair (g_Π^*, v_Π^*) is asserted after fixing Π, yet the proof sketch does not address whether different valid complements produce different transient components or whether the reconstruction identity holds uniformly. A concrete counter-example with a two-state periodic chain would clarify whether the quotient is truly minimal and exact for all admissible Π.

    Authors: With the revised definition of Π in §3, the complement is uniquely determined as the invariant stable subspace, so the pair (g_Π^*, v_Π^*) is unique. We will add a concrete example using a two-state deterministic periodic chain to demonstrate that non-invariant complements lead to non-transient v, while the invariant choice yields exact finite-horizon reconstruction and confirms minimality of the quotient. revision: yes

  3. Referee: [§5] §5 (comparison with classical gain-bias): The reallocation argument that all persistent modes move into g_Π^* while v_Π^* becomes purely transient is stated without an explicit change-of-basis calculation between the classical bias and the new v_Π^*. The spectral-radius claim for the quotient operator needs a short lemma showing that the induced map on V/K(P) has spectral radius strictly less than 1.

    Authors: We will include an explicit change-of-basis calculation in §5 comparing the classical bias to v_Π^*, showing the reallocation of persistent modes to g_Π^*. We will also add a short lemma establishing that the induced operator on the quotient V/K(P) has spectral radius strictly less than 1, following from the fact that its spectrum is that of P restricted to E_stab. revision: yes

Circularity Check

0 steps flagged

Derivation is self-contained with no circular reductions

full rationale

The paper defines the peripheral subspace K(P) from the spectral properties of the transition matrix P, selects a gauge projection Π with that kernel, and constructs the decomposition r = g_Π^* + (I-P)v_Π^* directly from these objects. The uniqueness and transience claims follow from the algebraic properties of the quotient and the choice of Π rather than from any fitted parameter, self-referential definition, or load-bearing self-citation. No step equates a derived quantity to its own input by construction, and the comparison to classical gain-bias methods reallocates existing information without circularity. The derivation is therefore independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on standard linear-algebraic properties of finite stochastic matrices and introduces one new projection operator to achieve uniqueness.

axioms (2)
  • domain assumption Finite Markov chains possess a well-defined peripheral invariant subspace K(P) that contains all non-decaying modes of the transition matrix P.
    Invoked to justify that quotienting by K(P) removes every non-decaying component.
  • ad hoc to paper A gauge projection Π with kernel exactly K(P) exists and yields a unique decomposition of the reward.
    Central to the uniqueness claim; introduced in the abstract without further justification.
invented entities (1)
  • Gauge projection Π no independent evidence
    purpose: To fix the gauge so that the decomposition r = g_Π^* + (I-P)v_Π^* is unique.
    New operator defined by the paper to separate persistent and transient parts.

pith-pipeline@v0.9.0 · 5536 in / 1444 out tokens · 65235 ms · 2026-05-16T09:20:59.073822+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.