Minimax PAC Bounds for Learning in Exogenous Contextual MDPs

Corentin Pla; Hugo Richard; Marc Abeille; Vianney Perchet

arxiv: 2606.25170 · v1 · pith:KF5RMDIFnew · submitted 2026-06-23 · 📊 stat.ML · cs.LG

Minimax PAC Bounds for Learning in Exogenous Contextual MDPs

Corentin Pla , Hugo Richard , Marc Abeille , Vianney Perchet This is my paper

Pith reviewed 2026-06-25 21:37 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords exogenous contextual MDPsPAC boundspolicy evaluationvariance reductionsample complexityminimax optimalitydiscounted MDPscontext distribution sampling

0 comments

The pith

A variance-reduced algorithm achieves |Z|-independent minimax sample complexity for policy evaluation in exogenous contextual MDPs when transitions are known.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies PAC learning in discounted tabular MDPs where contexts arrive i.i.d. from an unknown distribution and shape rewards and transitions but remain outside the agent's control. It distinguishes regimes based on access to sampling oracles for the context distribution and the transition kernel. When rewards and transitions are known, the authors present a variance-reduced algorithm that solves policy evaluation, best-value estimation, and best-policy extraction using only pre-execution samples of the context distribution. The resulting complexity is independent of the context space size and matches known lower bounds up to logs. This matters for problems with many external contexts, such as user features or environmental conditions, because it shows the learning cost need not grow with the number of possible contexts.

Core claim

When rewards and transitions are known and only the context distribution μ is sampled, a variance-reduced algorithm solves policy evaluation, best-value estimation, and best-policy extraction with sample complexity (O~(1/((1-γ)^3 ε²)), 0). The rate is independent of |Z| and minimax optimal up to logarithmic factors. As a corollary, tight rates hold for one-step perfect look-ahead. In the fully unknown regime where both μ and P must be learned, policy evaluation remains |Z|-free, with matching upper and lower bounds (O~(|X|/((1-γ)^3 ε²)), O~(1/((1-γ)^2 ε²))).

What carries the argument

variance-reduced algorithm that uses pre-execution sampling oracles for the context distribution μ to eliminate dependence on |Z| while preserving minimax rates for policy evaluation tasks

If this is right

Policy evaluation, best-value estimation, and best-policy extraction all achieve the same |Z|-free rate when transitions are known.
The bounds remain minimax optimal up to logarithmic factors.
One-step perfect look-ahead settings obtain improved tight rates as a direct corollary.
When both μ and P are unknown, policy evaluation still has |Z|-free complexity with explicit matching upper and lower bounds.
The on-policy sample requirement can be driven to zero in the known-transition regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation between context sampling and MDP dynamics may allow similar independence results when function approximation is used for large or continuous state spaces.
The technique could be tested in simulated environments with very large |Z| to check whether practical performance matches the theoretical independence.
If the i.i.d. context assumption is relaxed to weak dependence, the rates might extend with only extra logarithmic factors in the sample bounds.
The results highlight that exogenous contexts add a controllable sampling overhead rather than an inherent enumeration cost.

Load-bearing premise

Contexts arrive i.i.d. from a fixed unknown distribution and the learner has access to sampling oracles for μ and possibly the transition kernel before and during execution.

What would settle it

If executing the variance-reduced algorithm on instances with increasing |Z| requires a number of context samples that grows with |Z| to reach target accuracy ε, that would falsify the claimed independence from context space size.

read the original abstract

We study PAC learning in tabular discounted Markov decision processes with exogenous i.i.d. contexts, with discount factor $\gamma$, finite state space $\mathcal X$, action space $\mathcal A$, and context space $\mathcal Z$. At each time step, a context is drawn independently from an unknown distribution $\mu$ and revealed before the agent acts. This context may affect both rewards and transitions, while remaining uncontrolled by the agent. Depending on the regime, the learner has access either to a sampling oracle for $\mu$, to a sampling oracle for the transition kernel conditioned on state-context-action tuples, or to both. Oracles can be accessed before and during policy execution. The sample complexity is measured by a couple $(n,m)$, where $n$ is the number of calls to the sampling oracles before execution and $m$ is the number of calls to the sampling oracles during execution. When rewards and transitions are known and only the context distribution $\mu$ is sampled, we give a variance-reduced algorithm that solves policy evaluation (PE), best-value estimation (BVE), and best-policy extraction (BPE) with $\left(\widetilde O\left(1/((1-\gamma)^3\varepsilon^2)\right), 0 \right) $ sample complexity. The rate is independent of $|\mathcal Z|$ and minimax optimal up to logarithmic factors. As a corollary, we also obtain tight rates in the case of one-step perfect look-ahead, improving upon the existing guarantees. In the fully unknown regime, where both $\mu$ and P must be learned, we show that PE remains $|\mathcal Z|$-free, with matching upper and lower bounds $\bigl(\widetilde O(|\mathcal X|/((1-\gamma)^3\varepsilon^2)),\, \widetilde O(1/((1-\gamma)^2\varepsilon^2))\bigr)$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a variance-reduced algorithm for |Z|-free minimax rates on policy evaluation and extraction in exogenous contextual MDPs when P and r are known.

read the letter

The key takeaway is that when rewards and transitions are known, their variance-reduced method solves policy evaluation, best-value estimation, and best-policy extraction with sample complexity Õ(1/((1-γ)^3 ε²)) and zero online samples. The rate is independent of |Z| and matches the lower bound up to logs.

This is new in the exogenous i.i.d. context setting with the stated oracles; the effective operator T v(x) = E_{z~μ}[max_a r(x,a,z) + γ E_{x'} P(x'|x,a,z) v(x')] can be estimated with variance bounded independently of |Z|, which lets standard variance reduction deliver the bound. They also show policy evaluation stays |Z|-free in the fully unknown regime with matching upper and lower bounds, and they tighten the one-step look-ahead case.

The work is confined to tabular MDPs, which is a real limit on scope but standard for these exact minimax claims. The oracle access model is spelled out clearly and the variance argument checks out without circularity or post-hoc fitting. No load-bearing gaps appear in the central construction.

This is for RL theorists working on sample complexity in contextual or side-information settings. A reader who cares about PAC bounds for MDPs with exogenous noise will find the rates and proof ideas worth their time.

Send it to peer review; the claims are sharp enough to justify referee effort.

Referee Report

0 major / 2 minor

Summary. The paper studies PAC learning in tabular discounted MDPs with exogenous i.i.d. contexts drawn from unknown μ. It considers oracles for μ and/or the transition kernel P. When P and rewards are known, a variance-reduced algorithm achieves (Õ(1/((1-γ)^3 ε²)), 0) samples for policy evaluation (PE), best-value estimation (BVE), and best-policy extraction (BPE), independent of |Z| and matching minimax lower bounds up to logs. A corollary gives tight rates for one-step perfect look-ahead. In the fully unknown regime, PE remains |Z|-free with matching bounds (Õ(|X|/((1-γ)^3 ε²)), Õ(1/((1-γ)^2 ε²))).

Significance. If the derivations hold, the results are significant for establishing |Z|-independent minimax-optimal rates in exogenous contextual MDPs, a setting where contexts affect rewards/transitions but are uncontrolled. The variance-reduced approach for unbiased estimation of the effective operator T yields the stated rates with m=0, and the matching lower bounds strengthen the contribution. The one-step look-ahead corollary improves existing guarantees.

minor comments (2)

[Abstract / Introduction] The definition of the effective operator T and its variance bound O(1/(1-γ)^2) should be stated explicitly in the main text with a short derivation sketch, even if standard.
[Abstract] Clarify the precise meaning of 'one-step perfect look-ahead' and how the corollary follows from the main result.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work and for recommending acceptance. We are pleased that the contributions on |Z|-free minimax PAC bounds for policy evaluation and best-policy extraction in exogenous contextual MDPs were viewed as significant.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives sample-complexity bounds for PE/BVE/BPE via standard stochastic approximation and variance-reduction (SVRG-style) applied to an unbiased Monte-Carlo estimator of the effective operator T whose variance is bounded by O(1/(1-γ)^2) independently of |Z|. The construction uses only the exogenous-context assumption, known P/r, and classical convergence rates for discounted Bellman operators; no step reduces by definition to a fitted parameter, self-citation chain, or renamed empirical pattern. The minimax claim is supported by matching lower bounds obtained from the same model, not by internal re-use of the upper-bound quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The results rest on standard finite tabular MDP assumptions and the exogenous i.i.d. context model; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Finite state space X, action space A, context space Z; discount factor γ < 1
Tabular discounted MDP setting stated in the abstract.
domain assumption Contexts drawn i.i.d. from unknown μ and revealed before action
Core modeling choice for exogenous contexts.

pith-pipeline@v0.9.1-grok · 5878 in / 1309 out tokens · 25845 ms · 2026-06-25T21:37:43.760108+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 5 canonical work pages

[1]

Alekh Agarwal, Sham Kakade, and Lin F

URLhttps://arxiv.org/abs/2002.11349. Alekh Agarwal, Sham Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is minimax optimal,

arXiv 2002
[2]

9 Simón Algorta and Özgür ¸ Sim¸ sek

URLhttps://arxiv.org/abs/1906.03804. 9 Simón Algorta and Özgür ¸ Sim¸ sek. The game of tetris in machine learning.arXiv preprint arXiv:1905.01652,

arXiv 1906
[3]

doi: 10.1609/aaai.v38i10.28953

ISSN 2159-5399. doi: 10.1609/aaai.v38i10.28953. URL http://dx.doi.org/10.1609/aaai.v38i10.28953. Mohammad Gheshlaghi Azar, Remi Munos, and Bert Kappen. On the sample complexity of reinforcement learning with a generative model,

work page doi:10.1609/aaai.v38i10.28953
[4]

Gah-Yi Ban and N

URLhttps://arxiv.org/abs/2109.03173. Gah-Yi Ban and N. Bora Keskin. Personalized dynamic pricing with machine learning: High- dimensional features and heterogeneous elasticity.Management Science, 67(9):5549–5568,

arXiv
[5]

Max Bodoia and Nikhil Puranik

URL https: //arxiv.org/abs/2503.13200. Max Bodoia and Nikhil Puranik. Applying reinforcement learning to competitive tetris,

arXiv
[6]

Eduardo F

doi: 10.1109/TAC.2006.875041. Eduardo F. Camacho and Carlos Bordons.Model Predictive Control. Advanced Textbooks in Control and Signal Processing. Springer London, 2 edition,

work page doi:10.1109/tac.2006.875041 2006
[7]

doi: 10.1007/978-0-85729-398-5

ISBN 978-0-85729-398-5. doi: 10.1007/978-0-85729-398-5. Xi Chen, David Simchi-Levi, and Yining Wang. Privacy-preserving dynamic personalized pricing with demand learning.Management Science,

work page doi:10.1007/978-0-85729-398-5
[8]

arthur flajolet and Patrick Jaillet

URL https://arxiv.org/abs/2206.04282. arthur flajolet and Patrick Jaillet. Real-time bidding with side information. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors,Advances in Neural Information Processing Systems, volume

arXiv
[9]

Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer

URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 0bed45bd5774ffddc95ffe500024f628-Paper.pdf. Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer. Approximate dynamic programming finally performs well in the game of tetris. InAdvances in Neural Information Processing Systems,

2017
[10]

Chenbei Lu, Zaiwei Chen, Tongxin Li, Chenye Wu, and Adam Wierman

URLhttps://arxiv.org/abs/2102.06548. Chenbei Lu, Zaiwei Chen, Tongxin Li, Chenye Wu, and Adam Wierman. Reinforcement learning with imperfect transition predictions: A bellman-jensen approach,

arXiv
[11]

org/abs/2510.18687

URL https://arxiv. org/abs/2510.18687. Aymen Al Marjani, Aurélien Garivier, and Alexandre Proutiere. Navigating to the best policy in markov decision processes,

arXiv
[12]

Nadav Merlis

URLhttps://arxiv.org/abs/2106.02847. Nadav Merlis. Reinforcement learning with lookahead information,

arXiv
[13]

org/abs/2406.02258

URL https://arxiv. org/abs/2406.02258. Ali Mesbah. Stochastic model predictive control: An overview and perspectives for future research. IEEE Control Systems Magazine, 36(6):30–44,

arXiv
[14]

Stochastic

doi: 10.1109/MCS.2016.2602087. 10 Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of reinforcement learning using linearly combined model ensembles,

work page doi:10.1109/mcs.2016.2602087 2016
[15]

URLhttps://arxiv.org/abs/1910. 10597. Ashwin Pananjady and Martin J. Wainwright. Instance-dependent ℓ∞-bounds for policy evaluation in tabular reinforcement learning,

1910
[16]

Corentin Pla, Hugo Richard, Marc Abeille, Nadav Merlis, and Vianney Perchet

URLhttps://arxiv.org/abs/1909.08749. Corentin Pla, Hugo Richard, Marc Abeille, Nadav Merlis, and Vianney Perchet. On the hardness of reinforcement learning with transition look-ahead,

arXiv 1909
[17]

doi: 10.1287/inte.2020.1047

ISSN 0092-2102. doi: 10.1287/inte.2020.1047. URL https://doi.org/10. 1287/inte.2020.1047. Aaron Sidford, Mengdi Wang, Xian Wu, Lin F. Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving discounted markov decision process with a generative model,

work page doi:10.1287/inte.2020.1047 2020
[18]

URL https://arxiv.org/abs/1806.01492. Sean R. Sinclair, Felipe Frujeri, Ching-An Cheng, Luke Marshall, Hugo Barbalho, Jingling Li, Jennifer Neville, Ishai Menache, and Adith Swaminathan. Hindsight learning for mdps with exogenous inputs,

Pith/arXiv arXiv
[19]

George Trimponias and Thomas G

URLhttps://arxiv.org/abs/2207.06272. George Trimponias and Thomas G. Dietterich. Reinforcement learning with exogenous states and rewards,

arXiv
[20]

Martin J

URLhttps://arxiv.org/abs/2303.12957. Martin J. Wainwright. Variance-reduced q-learning is minimax optimal,

arXiv
[21]

Jia Wan, Sean R

URL https:// arxiv.org/abs/1906.04697. Jia Wan, Sean R. Sinclair, Devavrat Shah, and Martin J. Wainwright. Exploiting exogenous struc- ture for sample-efficient reinforcement learning,

arXiv 1906
[22]

URL https://arxiv.org/abs/2409. 14557. Hanzhao Wang, Kalyan Talluri, and Xiaocheng Li. Technical note—on dynamic pricing with covariates.Operations Research, 73(4):1932–1943,

1932
[23]

Under π and initial state X0 =x , the process ((Xt, Zt, At))t≥0 evolves as Zt i.i.d

17 Notation Fix a deterministic context-aware policy π:X × Z → A (the stochastic case is identical, with an additional integration over a). Under π and initial state X0 =x , the process ((Xt, Zt, At))t≥0 evolves as Zt i.i.d. ∼µ, A t =π(X t, Zt), X t+1 ∼P(· |X t, At, Zt),(24) withX 0 =x. We writeE x andVar x for expectation and variance under this law. Rec...

2019
[24]

Corollary 2(Bound on the complexity functional).For every context-aware policyπ, V(P, µ, π) := (I−γ ¯P π)−1√ vπ ∞ ≤ r1 +γ 1−γ ·β≤ √ 2β 3/2.(37) Proof

Sincex∈ Xis arbitrary, this concludes the proof. Corollary 2(Bound on the complexity functional).For every context-aware policyπ, V(P, µ, π) := (I−γ ¯P π)−1√ vπ ∞ ≤ r1 +γ 1−γ ·β≤ √ 2β 3/2.(37) Proof. The matrix ¯P π is row-stochastic, hence ∥ ¯P π∥ℓ∞→ℓ∞ ≤1 . Applying Lemma C.3 of Sidford et al. [2019], (I−γ ¯P π)−1√ vπ ∞ ≤ r1 +γ 1−γ (I−γ 2 ¯P π)−1vπ 1/2 ∞...

2019
[25]

The same algorithm serves both BVE and BPE. In both cases the relevant operator is the optimality operator (T¯v)(x) :=E Z∼µ h max a∈A n r(x, a, Z) +γ X x′ P(x ′ |x, a, Z) ¯v(x′) oi ,(40) and HALFERRestimates its fixed point ¯V ⋆. The BVE guarantee is read off directly from the output ˆ¯V ⋆; the BPE guarantee follows by greedy extraction at no additional s...

2019

[1] [1]

Alekh Agarwal, Sham Kakade, and Lin F

URLhttps://arxiv.org/abs/2002.11349. Alekh Agarwal, Sham Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is minimax optimal,

arXiv 2002

[2] [2]

9 Simón Algorta and Özgür ¸ Sim¸ sek

URLhttps://arxiv.org/abs/1906.03804. 9 Simón Algorta and Özgür ¸ Sim¸ sek. The game of tetris in machine learning.arXiv preprint arXiv:1905.01652,

arXiv 1906

[3] [3]

doi: 10.1609/aaai.v38i10.28953

ISSN 2159-5399. doi: 10.1609/aaai.v38i10.28953. URL http://dx.doi.org/10.1609/aaai.v38i10.28953. Mohammad Gheshlaghi Azar, Remi Munos, and Bert Kappen. On the sample complexity of reinforcement learning with a generative model,

work page doi:10.1609/aaai.v38i10.28953

[4] [4]

Gah-Yi Ban and N

URLhttps://arxiv.org/abs/2109.03173. Gah-Yi Ban and N. Bora Keskin. Personalized dynamic pricing with machine learning: High- dimensional features and heterogeneous elasticity.Management Science, 67(9):5549–5568,

arXiv

[5] [5]

Max Bodoia and Nikhil Puranik

URL https: //arxiv.org/abs/2503.13200. Max Bodoia and Nikhil Puranik. Applying reinforcement learning to competitive tetris,

arXiv

[6] [6]

Eduardo F

doi: 10.1109/TAC.2006.875041. Eduardo F. Camacho and Carlos Bordons.Model Predictive Control. Advanced Textbooks in Control and Signal Processing. Springer London, 2 edition,

work page doi:10.1109/tac.2006.875041 2006

[7] [7]

doi: 10.1007/978-0-85729-398-5

ISBN 978-0-85729-398-5. doi: 10.1007/978-0-85729-398-5. Xi Chen, David Simchi-Levi, and Yining Wang. Privacy-preserving dynamic personalized pricing with demand learning.Management Science,

work page doi:10.1007/978-0-85729-398-5

[8] [8]

arthur flajolet and Patrick Jaillet

URL https://arxiv.org/abs/2206.04282. arthur flajolet and Patrick Jaillet. Real-time bidding with side information. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors,Advances in Neural Information Processing Systems, volume

arXiv

[9] [9]

Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer

URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 0bed45bd5774ffddc95ffe500024f628-Paper.pdf. Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer. Approximate dynamic programming finally performs well in the game of tetris. InAdvances in Neural Information Processing Systems,

2017

[10] [10]

Chenbei Lu, Zaiwei Chen, Tongxin Li, Chenye Wu, and Adam Wierman

URLhttps://arxiv.org/abs/2102.06548. Chenbei Lu, Zaiwei Chen, Tongxin Li, Chenye Wu, and Adam Wierman. Reinforcement learning with imperfect transition predictions: A bellman-jensen approach,

arXiv

[11] [11]

org/abs/2510.18687

URL https://arxiv. org/abs/2510.18687. Aymen Al Marjani, Aurélien Garivier, and Alexandre Proutiere. Navigating to the best policy in markov decision processes,

arXiv

[12] [12]

Nadav Merlis

URLhttps://arxiv.org/abs/2106.02847. Nadav Merlis. Reinforcement learning with lookahead information,

arXiv

[13] [13]

org/abs/2406.02258

URL https://arxiv. org/abs/2406.02258. Ali Mesbah. Stochastic model predictive control: An overview and perspectives for future research. IEEE Control Systems Magazine, 36(6):30–44,

arXiv

[14] [14]

Stochastic

doi: 10.1109/MCS.2016.2602087. 10 Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of reinforcement learning using linearly combined model ensembles,

work page doi:10.1109/mcs.2016.2602087 2016

[15] [15]

URLhttps://arxiv.org/abs/1910. 10597. Ashwin Pananjady and Martin J. Wainwright. Instance-dependent ℓ∞-bounds for policy evaluation in tabular reinforcement learning,

1910

[16] [16]

Corentin Pla, Hugo Richard, Marc Abeille, Nadav Merlis, and Vianney Perchet

URLhttps://arxiv.org/abs/1909.08749. Corentin Pla, Hugo Richard, Marc Abeille, Nadav Merlis, and Vianney Perchet. On the hardness of reinforcement learning with transition look-ahead,

arXiv 1909

[17] [17]

doi: 10.1287/inte.2020.1047

ISSN 0092-2102. doi: 10.1287/inte.2020.1047. URL https://doi.org/10. 1287/inte.2020.1047. Aaron Sidford, Mengdi Wang, Xian Wu, Lin F. Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving discounted markov decision process with a generative model,

work page doi:10.1287/inte.2020.1047 2020

[18] [18]

URL https://arxiv.org/abs/1806.01492. Sean R. Sinclair, Felipe Frujeri, Ching-An Cheng, Luke Marshall, Hugo Barbalho, Jingling Li, Jennifer Neville, Ishai Menache, and Adith Swaminathan. Hindsight learning for mdps with exogenous inputs,

Pith/arXiv arXiv

[19] [19]

George Trimponias and Thomas G

URLhttps://arxiv.org/abs/2207.06272. George Trimponias and Thomas G. Dietterich. Reinforcement learning with exogenous states and rewards,

arXiv

[20] [20]

Martin J

URLhttps://arxiv.org/abs/2303.12957. Martin J. Wainwright. Variance-reduced q-learning is minimax optimal,

arXiv

[21] [21]

Jia Wan, Sean R

URL https:// arxiv.org/abs/1906.04697. Jia Wan, Sean R. Sinclair, Devavrat Shah, and Martin J. Wainwright. Exploiting exogenous struc- ture for sample-efficient reinforcement learning,

arXiv 1906

[22] [22]

URL https://arxiv.org/abs/2409. 14557. Hanzhao Wang, Kalyan Talluri, and Xiaocheng Li. Technical note—on dynamic pricing with covariates.Operations Research, 73(4):1932–1943,

1932

[23] [23]

Under π and initial state X0 =x , the process ((Xt, Zt, At))t≥0 evolves as Zt i.i.d

17 Notation Fix a deterministic context-aware policy π:X × Z → A (the stochastic case is identical, with an additional integration over a). Under π and initial state X0 =x , the process ((Xt, Zt, At))t≥0 evolves as Zt i.i.d. ∼µ, A t =π(X t, Zt), X t+1 ∼P(· |X t, At, Zt),(24) withX 0 =x. We writeE x andVar x for expectation and variance under this law. Rec...

2019

[24] [24]

Corollary 2(Bound on the complexity functional).For every context-aware policyπ, V(P, µ, π) := (I−γ ¯P π)−1√ vπ ∞ ≤ r1 +γ 1−γ ·β≤ √ 2β 3/2.(37) Proof

Sincex∈ Xis arbitrary, this concludes the proof. Corollary 2(Bound on the complexity functional).For every context-aware policyπ, V(P, µ, π) := (I−γ ¯P π)−1√ vπ ∞ ≤ r1 +γ 1−γ ·β≤ √ 2β 3/2.(37) Proof. The matrix ¯P π is row-stochastic, hence ∥ ¯P π∥ℓ∞→ℓ∞ ≤1 . Applying Lemma C.3 of Sidford et al. [2019], (I−γ ¯P π)−1√ vπ ∞ ≤ r1 +γ 1−γ (I−γ 2 ¯P π)−1vπ 1/2 ∞...

2019

[25] [25]

The same algorithm serves both BVE and BPE. In both cases the relevant operator is the optimality operator (T¯v)(x) :=E Z∼µ h max a∈A n r(x, a, Z) +γ X x′ P(x ′ |x, a, Z) ¯v(x′) oi ,(40) and HALFERRestimates its fixed point ¯V ⋆. The BVE guarantee is read off directly from the output ˆ¯V ⋆; the BPE guarantee follows by greedy extraction at no additional s...

2019