Minimax PAC Bounds for Learning in Exogenous Contextual MDPs
Pith reviewed 2026-06-25 21:37 UTC · model grok-4.3
The pith
A variance-reduced algorithm achieves |Z|-independent minimax sample complexity for policy evaluation in exogenous contextual MDPs when transitions are known.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When rewards and transitions are known and only the context distribution μ is sampled, a variance-reduced algorithm solves policy evaluation, best-value estimation, and best-policy extraction with sample complexity (O~(1/((1-γ)^3 ε²)), 0). The rate is independent of |Z| and minimax optimal up to logarithmic factors. As a corollary, tight rates hold for one-step perfect look-ahead. In the fully unknown regime where both μ and P must be learned, policy evaluation remains |Z|-free, with matching upper and lower bounds (O~(|X|/((1-γ)^3 ε²)), O~(1/((1-γ)^2 ε²))).
What carries the argument
variance-reduced algorithm that uses pre-execution sampling oracles for the context distribution μ to eliminate dependence on |Z| while preserving minimax rates for policy evaluation tasks
If this is right
- Policy evaluation, best-value estimation, and best-policy extraction all achieve the same |Z|-free rate when transitions are known.
- The bounds remain minimax optimal up to logarithmic factors.
- One-step perfect look-ahead settings obtain improved tight rates as a direct corollary.
- When both μ and P are unknown, policy evaluation still has |Z|-free complexity with explicit matching upper and lower bounds.
- The on-policy sample requirement can be driven to zero in the known-transition regime.
Where Pith is reading between the lines
- The separation between context sampling and MDP dynamics may allow similar independence results when function approximation is used for large or continuous state spaces.
- The technique could be tested in simulated environments with very large |Z| to check whether practical performance matches the theoretical independence.
- If the i.i.d. context assumption is relaxed to weak dependence, the rates might extend with only extra logarithmic factors in the sample bounds.
- The results highlight that exogenous contexts add a controllable sampling overhead rather than an inherent enumeration cost.
Load-bearing premise
Contexts arrive i.i.d. from a fixed unknown distribution and the learner has access to sampling oracles for μ and possibly the transition kernel before and during execution.
What would settle it
If executing the variance-reduced algorithm on instances with increasing |Z| requires a number of context samples that grows with |Z| to reach target accuracy ε, that would falsify the claimed independence from context space size.
read the original abstract
We study PAC learning in tabular discounted Markov decision processes with exogenous i.i.d. contexts, with discount factor $\gamma$, finite state space $\mathcal X$, action space $\mathcal A$, and context space $\mathcal Z$. At each time step, a context is drawn independently from an unknown distribution $\mu$ and revealed before the agent acts. This context may affect both rewards and transitions, while remaining uncontrolled by the agent. Depending on the regime, the learner has access either to a sampling oracle for $\mu$, to a sampling oracle for the transition kernel conditioned on state-context-action tuples, or to both. Oracles can be accessed before and during policy execution. The sample complexity is measured by a couple $(n,m)$, where $n$ is the number of calls to the sampling oracles before execution and $m$ is the number of calls to the sampling oracles during execution. When rewards and transitions are known and only the context distribution $\mu$ is sampled, we give a variance-reduced algorithm that solves policy evaluation (PE), best-value estimation (BVE), and best-policy extraction (BPE) with $\left(\widetilde O\left(1/((1-\gamma)^3\varepsilon^2)\right), 0 \right) $ sample complexity. The rate is independent of $|\mathcal Z|$ and minimax optimal up to logarithmic factors. As a corollary, we also obtain tight rates in the case of one-step perfect look-ahead, improving upon the existing guarantees. In the fully unknown regime, where both $\mu$ and P must be learned, we show that PE remains $|\mathcal Z|$-free, with matching upper and lower bounds $\bigl(\widetilde O(|\mathcal X|/((1-\gamma)^3\varepsilon^2)),\, \widetilde O(1/((1-\gamma)^2\varepsilon^2))\bigr)$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies PAC learning in tabular discounted MDPs with exogenous i.i.d. contexts drawn from unknown μ. It considers oracles for μ and/or the transition kernel P. When P and rewards are known, a variance-reduced algorithm achieves (Õ(1/((1-γ)^3 ε²)), 0) samples for policy evaluation (PE), best-value estimation (BVE), and best-policy extraction (BPE), independent of |Z| and matching minimax lower bounds up to logs. A corollary gives tight rates for one-step perfect look-ahead. In the fully unknown regime, PE remains |Z|-free with matching bounds (Õ(|X|/((1-γ)^3 ε²)), Õ(1/((1-γ)^2 ε²))).
Significance. If the derivations hold, the results are significant for establishing |Z|-independent minimax-optimal rates in exogenous contextual MDPs, a setting where contexts affect rewards/transitions but are uncontrolled. The variance-reduced approach for unbiased estimation of the effective operator T yields the stated rates with m=0, and the matching lower bounds strengthen the contribution. The one-step look-ahead corollary improves existing guarantees.
minor comments (2)
- [Abstract / Introduction] The definition of the effective operator T and its variance bound O(1/(1-γ)^2) should be stated explicitly in the main text with a short derivation sketch, even if standard.
- [Abstract] Clarify the precise meaning of 'one-step perfect look-ahead' and how the corollary follows from the main result.
Simulated Author's Rebuttal
We thank the referee for their positive summary of our work and for recommending acceptance. We are pleased that the contributions on |Z|-free minimax PAC bounds for policy evaluation and best-policy extraction in exogenous contextual MDPs were viewed as significant.
Circularity Check
No significant circularity detected
full rationale
The paper derives sample-complexity bounds for PE/BVE/BPE via standard stochastic approximation and variance-reduction (SVRG-style) applied to an unbiased Monte-Carlo estimator of the effective operator T whose variance is bounded by O(1/(1-γ)^2) independently of |Z|. The construction uses only the exogenous-context assumption, known P/r, and classical convergence rates for discounted Bellman operators; no step reduces by definition to a fitted parameter, self-citation chain, or renamed empirical pattern. The minimax claim is supported by matching lower bounds obtained from the same model, not by internal re-use of the upper-bound quantities.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Finite state space X, action space A, context space Z; discount factor γ < 1
- domain assumption Contexts drawn i.i.d. from unknown μ and revealed before action
Reference graph
Works this paper leans on
-
[1]
Alekh Agarwal, Sham Kakade, and Lin F
URLhttps://arxiv.org/abs/2002.11349. Alekh Agarwal, Sham Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is minimax optimal,
arXiv 2002
-
[2]
9 Simón Algorta and Özgür ¸ Sim¸ sek
URLhttps://arxiv.org/abs/1906.03804. 9 Simón Algorta and Özgür ¸ Sim¸ sek. The game of tetris in machine learning.arXiv preprint arXiv:1905.01652,
arXiv 1906
-
[3]
doi: 10.1609/aaai.v38i10.28953
ISSN 2159-5399. doi: 10.1609/aaai.v38i10.28953. URL http://dx.doi.org/10.1609/aaai.v38i10.28953. Mohammad Gheshlaghi Azar, Remi Munos, and Bert Kappen. On the sample complexity of reinforcement learning with a generative model,
-
[4]
URLhttps://arxiv.org/abs/2109.03173. Gah-Yi Ban and N. Bora Keskin. Personalized dynamic pricing with machine learning: High- dimensional features and heterogeneous elasticity.Management Science, 67(9):5549–5568,
-
[5]
URL https: //arxiv.org/abs/2503.13200. Max Bodoia and Nikhil Puranik. Applying reinforcement learning to competitive tetris,
-
[6]
doi: 10.1109/TAC.2006.875041. Eduardo F. Camacho and Carlos Bordons.Model Predictive Control. Advanced Textbooks in Control and Signal Processing. Springer London, 2 edition,
-
[7]
doi: 10.1007/978-0-85729-398-5
ISBN 978-0-85729-398-5. doi: 10.1007/978-0-85729-398-5. Xi Chen, David Simchi-Levi, and Yining Wang. Privacy-preserving dynamic personalized pricing with demand learning.Management Science,
-
[8]
arthur flajolet and Patrick Jaillet
URL https://arxiv.org/abs/2206.04282. arthur flajolet and Patrick Jaillet. Real-time bidding with side information. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors,Advances in Neural Information Processing Systems, volume
-
[9]
Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer
URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 0bed45bd5774ffddc95ffe500024f628-Paper.pdf. Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer. Approximate dynamic programming finally performs well in the game of tetris. InAdvances in Neural Information Processing Systems,
2017
-
[10]
Chenbei Lu, Zaiwei Chen, Tongxin Li, Chenye Wu, and Adam Wierman
URLhttps://arxiv.org/abs/2102.06548. Chenbei Lu, Zaiwei Chen, Tongxin Li, Chenye Wu, and Adam Wierman. Reinforcement learning with imperfect transition predictions: A bellman-jensen approach,
-
[11]
URL https://arxiv. org/abs/2510.18687. Aymen Al Marjani, Aurélien Garivier, and Alexandre Proutiere. Navigating to the best policy in markov decision processes,
-
[12]
URLhttps://arxiv.org/abs/2106.02847. Nadav Merlis. Reinforcement learning with lookahead information,
-
[13]
URL https://arxiv. org/abs/2406.02258. Ali Mesbah. Stochastic model predictive control: An overview and perspectives for future research. IEEE Control Systems Magazine, 36(6):30–44,
-
[14]
doi: 10.1109/MCS.2016.2602087. 10 Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of reinforcement learning using linearly combined model ensembles,
-
[15]
URLhttps://arxiv.org/abs/1910. 10597. Ashwin Pananjady and Martin J. Wainwright. Instance-dependent ℓ∞-bounds for policy evaluation in tabular reinforcement learning,
1910
-
[16]
Corentin Pla, Hugo Richard, Marc Abeille, Nadav Merlis, and Vianney Perchet
URLhttps://arxiv.org/abs/1909.08749. Corentin Pla, Hugo Richard, Marc Abeille, Nadav Merlis, and Vianney Perchet. On the hardness of reinforcement learning with transition look-ahead,
arXiv 1909
-
[17]
ISSN 0092-2102. doi: 10.1287/inte.2020.1047. URL https://doi.org/10. 1287/inte.2020.1047. Aaron Sidford, Mengdi Wang, Xian Wu, Lin F. Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving discounted markov decision process with a generative model,
-
[18]
URL https://arxiv.org/abs/1806.01492. Sean R. Sinclair, Felipe Frujeri, Ching-An Cheng, Luke Marshall, Hugo Barbalho, Jingling Li, Jennifer Neville, Ishai Menache, and Adith Swaminathan. Hindsight learning for mdps with exogenous inputs,
-
[19]
George Trimponias and Thomas G
URLhttps://arxiv.org/abs/2207.06272. George Trimponias and Thomas G. Dietterich. Reinforcement learning with exogenous states and rewards,
- [20]
-
[21]
URL https:// arxiv.org/abs/1906.04697. Jia Wan, Sean R. Sinclair, Devavrat Shah, and Martin J. Wainwright. Exploiting exogenous struc- ture for sample-efficient reinforcement learning,
arXiv 1906
-
[22]
URL https://arxiv.org/abs/2409. 14557. Hanzhao Wang, Kalyan Talluri, and Xiaocheng Li. Technical note—on dynamic pricing with covariates.Operations Research, 73(4):1932–1943,
1932
-
[23]
Under π and initial state X0 =x , the process ((Xt, Zt, At))t≥0 evolves as Zt i.i.d
17 Notation Fix a deterministic context-aware policy π:X × Z → A (the stochastic case is identical, with an additional integration over a). Under π and initial state X0 =x , the process ((Xt, Zt, At))t≥0 evolves as Zt i.i.d. ∼µ, A t =π(X t, Zt), X t+1 ∼P(· |X t, At, Zt),(24) withX 0 =x. We writeE x andVar x for expectation and variance under this law. Rec...
2019
-
[24]
Corollary 2(Bound on the complexity functional).For every context-aware policyπ, V(P, µ, π) := (I−γ ¯P π)−1√ vπ ∞ ≤ r1 +γ 1−γ ·β≤ √ 2β 3/2.(37) Proof
Sincex∈ Xis arbitrary, this concludes the proof. Corollary 2(Bound on the complexity functional).For every context-aware policyπ, V(P, µ, π) := (I−γ ¯P π)−1√ vπ ∞ ≤ r1 +γ 1−γ ·β≤ √ 2β 3/2.(37) Proof. The matrix ¯P π is row-stochastic, hence ∥ ¯P π∥ℓ∞→ℓ∞ ≤1 . Applying Lemma C.3 of Sidford et al. [2019], (I−γ ¯P π)−1√ vπ ∞ ≤ r1 +γ 1−γ (I−γ 2 ¯P π)−1vπ 1/2 ∞...
2019
-
[25]
The same algorithm serves both BVE and BPE. In both cases the relevant operator is the optimality operator (T¯v)(x) :=E Z∼µ h max a∈A n r(x, a, Z) +γ X x′ P(x ′ |x, a, Z) ¯v(x′) oi ,(40) and HALFERRestimates its fixed point ¯V ⋆. The BVE guarantee is read off directly from the output ˆ¯V ⋆; the BPE guarantee follows by greedy extraction at no additional s...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.