pith. machine review for the scientific record. sign in

arxiv: 2604.13414 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI

Recognition: unknown

Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords minimax optimalityMarkov dependencemajority-vote ensemblesspectral routingmixing timeFiedler eigenvectorensemble classificationadaptive partitioning
0
0 comments X

The pith

Adaptive spectral routing achieves the minimax optimal rate of order square root of mixing time over sample size for majority-vote ensembles on Markov chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when data points follow a Markov chain, majority-vote ensembles cannot reduce excess classification risk faster than the square root of the mixing time divided by the number of samples. Standard uniform bagging is provably worse, with a gap of order square root of the mixing time. Adaptive spectral routing fixes this on a suitable subclass by splitting the data using the main eigenvector of a dependency graph built from the samples, reaching the best possible rate without needing to know the mixing time. This provides a precise understanding of how dependence hurts variance reduction in ensembles and offers a practical fix for dependent data sources such as time series and replay buffers.

Core claim

For stationary, reversible, geometrically ergodic Markov chains in fixed dimension, no measurable estimator attains excess classification risk smaller than order square root of mixing time over sample size. On the AR(1) subclass, uniform bagging has risk at least order mixing time over square root of sample size. Adaptive spectral routing attains the optimal order square root of mixing time over sample size up to a lower-order geometric cut term on graph-regular chains by partitioning according to the empirical Fiedler eigenvector of the dependency graph, without knowledge of the mixing time.

What carries the argument

The empirical Fiedler eigenvector of the dependency graph, which partitions the training data into groups with reduced dependence before majority voting.

If this is right

  • Uniform bagging suffers excess risk at least order mixing time over square root of sample size on the AR(1) subclass, creating a square root of mixing time algorithmic gap.
  • The optimal rate is achieved without knowledge of the mixing time.
  • The lower bound holds for the full class of stationary reversible geometrically ergodic chains while the upper bound holds for graph-regular subclasses.
  • The rates are confirmed in experiments on synthetic Markov chains, 2D spatial grids, the UCR time series archive, and Atari DQN replay buffers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The spectral partitioning idea could apply to other ensemble methods such as boosting when data exhibits Markov dependence.
  • In reinforcement learning, routing replay buffer samples this way might further stabilize target networks beyond uniform sampling.
  • Combining the approach with dimension reduction would be a natural next step to handle settings beyond fixed ambient dimension.

Load-bearing premise

The underlying process is a stationary reversible geometrically ergodic Markov chain in fixed ambient dimension, with the matching upper bound restricted to a graph-regular subclass.

What would settle it

An estimator that achieves excess risk o of square root of mixing time over sample size on some geometrically ergodic chain for large sample size, or a graph-regular chain where spectral routing exceeds order square root of mixing time over sample size.

Figures

Figures reproduced from arXiv: 2604.13414 by Anuj Sharma, Ibne Farabi Shihab, Sanjeda Akter.

Figure 1
Figure 1. Figure 1: Measured pairwise base-learner covari￾ance versus Tmix on the AR(1) witness. Uni￾form bagging tracks the T 2 mix/n prediction of Lemma 5, while spectral routing compresses the covariance by roughly two orders of magnitude. 100 101 102 10−1.5 10−1 10−0.5 Tmix Excess risk Uniform Theory: Tmix/ √ n Spec. Rte. Theory: p Tmix/n [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relationship between dataset autocorrelation and the accuracy gain of Spectral Routing [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Empirical verification of Assump￾tion 15 on Atari. reff plateaus to O(1). feature-similarity graph, as a heuristic manifestation of structural feature stratification rather than a resolution of Markovian resampling covariance [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Critical difference diagram over the full 128 UCR datasets. Methods connected by a [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

Majority-vote ensembles achieve variance reduction by averaging over diverse, approximately independent base learners. When training data exhibits Markov dependence, as in time-series forecasting, reinforcement learning (RL) replay buffers, and spatial grids, this classical guarantee degrades in ways that existing theory does not fully quantify. We provide a minimax characterization of this phenomenon for discrete classification in a fixed-dimensional Markov setting, together with an adaptive algorithm that matches the rate on a graph-regular subclass. We first establish an information-theoretic lower bound for stationary, reversible, geometrically ergodic chains in fixed ambient dimension, showing that no measurable estimator can achieve excess classification risk better than $\Omega(\sqrt{\Tmix/n})$. We then prove that, on the AR(1) witness subclass underlying the lower-bound construction, dependence-agnostic uniform bagging is provably suboptimal with excess risk bounded below by $\Omega(\Tmix/\sqrt{n})$, exhibiting a $\sqrt{\Tmix}$ algorithmic gap. Finally, we propose \emph{adaptive spectral routing}, which partitions the training data via the empirical Fiedler eigenvector of a dependency graph and achieves the minimax rate $\mathcal{O}(\sqrt{\Tmix/n})$ up to a lower-order geometric cut term on a graph-regular subclass, without knowledge of $\Tmix$. Experiments on synthetic Markov chains, 2D spatial grids, the 128-dataset UCR archive, and Atari DQN ensembles validate the theoretical predictions. Consequences for deep RL target variance, scalability via Nystr\"om approximation, and bounded non-stationarity are developed as supporting material in the appendix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript establishes a minimax lower bound of Ω(√(Tmix/n)) on excess classification risk for any measurable estimator applied to majority-vote ensembles, under the class of stationary, reversible, geometrically ergodic Markov chains in fixed ambient dimension. It shows that dependence-agnostic uniform bagging is suboptimal, attaining only Ω(Tmix/√n) on the AR(1) witness subclass used for the lower bound. The paper then introduces adaptive spectral routing, which constructs a dependency graph from the data, extracts its empirical Fiedler vector to partition the training set, and achieves the matching rate O(√(Tmix/n)) up to a lower-order geometric cut term on a graph-regular subclass, without requiring knowledge of Tmix. The results are illustrated with experiments on synthetic chains, 2D grids, the UCR archive, and Atari DQN replay buffers.

Significance. If the central claims hold, the work is significant for providing the first tight minimax characterization of how Markov dependence degrades classical ensemble variance reduction, together with an adaptive algorithm that attains the optimal rate on the relevant subclass. The construction of a matching upper bound via spectral partitioning without explicit mixing-time input, and the explicit demonstration of a √Tmix algorithmic gap for uniform bagging, are notable strengths. These results have direct implications for ensemble methods in time-series, spatial data, and reinforcement-learning replay buffers.

major comments (2)
  1. [Lower-bound section and graph-regular subclass definition] The lower-bound construction uses an AR(1) witness subclass of stationary reversible geometrically ergodic chains; the manuscript should explicitly verify that this subclass is contained in the graph-regular class on which the upper bound is proved, so that the Ω(√(Tmix/n)) and O(√(Tmix/n)) rates are directly comparable without additional restrictions.
  2. [Adaptive spectral routing algorithm and its analysis] The adaptive spectral routing algorithm relies on the empirical Fiedler vector of a data-derived dependency graph; the proof that this yields the minimax rate without knowledge of Tmix should clarify the approximation error between the empirical and population Fiedler vectors under only geometric ergodicity (rather than stronger mixing assumptions).
minor comments (2)
  1. [Upper-bound theorem statement] The geometric cut term appearing in the upper-bound rate should be defined explicitly in the main text (not only in the appendix) and its order relative to √(Tmix/n) should be stated clearly.
  2. [Notation and definitions] Notation for Tmix and the graph-regular subclass should be introduced at first use in the main body rather than deferred to the appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. The comments are constructive and help strengthen the connection between the lower and upper bounds as well as the clarity of the spectral analysis. We address each major comment below.

read point-by-point responses
  1. Referee: [Lower-bound section and graph-regular subclass definition] The lower-bound construction uses an AR(1) witness subclass of stationary reversible geometrically ergodic chains; the manuscript should explicitly verify that this subclass is contained in the graph-regular class on which the upper bound is proved, so that the Ω(√(Tmix/n)) and O(√(Tmix/n)) rates are directly comparable without additional restrictions.

    Authors: We agree that an explicit containment argument is needed for direct rate comparison. In the revised manuscript we will insert a short lemma in the lower-bound section establishing that the AR(1) witness subclass satisfies the graph-regular conditions. The AR(1) process generates a dependency graph that is a path (hence bounded degree and regular spectral properties), and the geometric ergodicity assumption already ensures the required uniform spectral gap. This addition removes any ambiguity and confirms that the Ω(√(Tmix/n)) lower bound holds inside the graph-regular subclass. revision: yes

  2. Referee: [Adaptive spectral routing algorithm and its analysis] The adaptive spectral routing algorithm relies on the empirical Fiedler vector of a data-derived dependency graph; the proof that this yields the minimax rate without knowledge of Tmix should clarify the approximation error between the empirical and population Fiedler vectors under only geometric ergodicity (rather than stronger mixing assumptions).

    Authors: We thank the referee for highlighting the need for an explicit error bound. The existing proof already invokes geometric ergodicity to control the deviation of the empirical adjacency matrix via standard concentration inequalities for Markov chains, followed by a Davis-Kahan perturbation argument on the Laplacian. In the revision we will expand this step into a dedicated paragraph that states the precise bound on ||v̂_emp - v_pop||, shows that the resulting additive term is o(√(Tmix/n)), and reiterates that only geometric ergodicity (no stronger mixing) is used. This will make the argument self-contained and transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives an information-theoretic minimax lower bound of Ω(√(Tmix/n)) for any measurable estimator on the class of stationary, reversible, geometrically ergodic Markov chains in fixed dimension, using standard techniques that do not reduce to fitted parameters or self-referential definitions. The adaptive spectral routing algorithm is constructed independently via the empirical Fiedler vector of a data-derived graph and attains the matching rate (modulo lower-order terms) on the graph-regular subclass containing the AR(1) witness, without knowledge of Tmix. This is standard minimax structure with no load-bearing self-citation chains, no renaming of known results as new derivations, and no predictions that are forced by construction from inputs. The analysis is externally falsifiable and self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed from abstract only; full paper likely contains additional technical assumptions on the Markov chain and graph construction that cannot be audited here.

axioms (2)
  • domain assumption Training data follows a stationary, reversible, geometrically ergodic Markov chain in fixed ambient dimension
    Explicitly stated as the setting for the information-theoretic lower bound.
  • domain assumption The upper-bound result holds on a graph-regular subclass of such chains
    Required for the spectral routing algorithm to achieve the minimax rate.

pith-pipeline@v0.9.0 · 5594 in / 1393 out tokens · 27843 ms · 2026-05-10T14:28:56.841552+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Online-to- PAC generalization bounds under graph-mixing dependencies

    Fran c ois Ab\'el\`es, Eugenio Clerico, and Massimiliano Pontil. Online-to- PAC generalization bounds under graph-mixing dependencies. In Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 2025

  2. [2]

    An optimistic perspective on offline reinforcement learning

    Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pages 104--114. PMLR, 2020

  3. [3]

    Bagging predictors

    Leo Breiman. Bagging predictors. Machine Learning, 24 0 (2): 0 123--140, 1996

  4. [4]

    Random forests

    Leo Breiman. Random forests. Machine Learning, 45 0 (1): 0 5--32, 2001

  5. [5]

    Angus Dempster, Fran c ois Petitjean, and Geoffrey I. Webb. ROCKET : Exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery, 34 0 (5): 0 1454--1495, 2020

  6. [6]

    A time series forest for classification and feature extraction

    Houtao Deng, George Runger, Eugene Tuv, and Martyanov Vladimir. A time series forest for classification and feature extraction. Information Sciences, 239: 0 142--153, 2013

  7. [7]

    Algebraic connectivity of graphs

    Miroslav Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23 0 (2): 0 298--305, 1973

  8. [8]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  9. [9]

    Neural tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 31, 2018

  10. [10]

    Generalization bounds for non-stationary mixing processes

    Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for non-stationary mixing processes. Machine Learning, 106 0 (1): 0 93--117, 2017

  11. [11]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, volume 30, 2017

  12. [12]

    Quantifying uncertainty in random forests via variance

    Lucas Mentch and Giles Hooker. Quantifying uncertainty in random forests via variance. Journal of Machine Learning Research, 17 0 (1): 0 841--881, 2016

  13. [13]

    Rademacher complexity bounds for non-iid processes

    Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-iid processes. In Advances in Neural Information Processing Systems, volume 21, 2008

  14. [14]

    Least squares regression with M arkovian data: Fundamental limits and algorithms

    Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with M arkovian data: Fundamental limits and algorithms. In Advances in Neural Information Processing Systems, volume 33, pages 16666--16676, 2020

  15. [15]

    Deep exploration via bootstrapped DQN

    Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN . In Advances in Neural Information Processing Systems, volume 29, 2016

  16. [16]

    Randomized prior functions for deep reinforcement learning

    Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 31, 2018

  17. [17]

    Politis and Joseph P

    Dimitris N. Politis and Joseph P. Romano. A circular block-resampling procedure for stationary data. In Raoul LePage and Lynne Billard, editors, Exploring the Limits of Bootstrap, pages 263--270. Wiley, 1992

  18. [18]

    Politis and Joseph P

    Dimitris N. Politis and Joseph P. Romano. The stationary bootstrap. Journal of the American Statistical Association, 89 0 (428): 0 1303--1313, 1994

  19. [19]

    Prioritized experience replay

    Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In International Conference on Learning Representations, 2016

  20. [20]

    Spielman and Nikhil Srivastava

    Daniel A. Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40 0 (6): 0 1913--1926, 2011

  21. [21]

    Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12 0 (4): 0 389--434, 2012

  22. [22]

    Tsybakov

    Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, 2009

  23. [23]

    Gomez, ukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017

  24. [24]

    Consistency of spectral clustering

    Ulrike von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. The Annals of Statistics, 36 0 (2): 0 555--586, 2008

  25. [25]

    Christopher K. I. Williams and Matthias Seeger. Using the N ystr \"o m method to speed up kernel machines. In Advances in Neural Information Processing Systems, volume 13, pages 682--688, 2001

  26. [26]

    Rates of convergence for empirical processes of stationary mixing sequences

    Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22 0 (1): 0 94--116, 1994

  27. [27]

    Graph-dependent generalization bounds

    Rui Zhang, Liu Liu, and Xiaojin Zhu. Graph-dependent generalization bounds. In International Conference on Machine Learning. PMLR, 2019

  28. [28]

    Learning with little mixing

    Ingvar Ziemann and Steve Hanneke. Learning with little mixing. In Advances in Neural Information Processing Systems, volume 35, 2022