Sample Complexity Bounds for Stochastic Shortest Path with a Generative Model

Alessandro Lazaric; Jean Tarbouriech; Matteo Pirotta; Michal Valko

arxiv: 2604.16111 · v1 · submitted 2026-04-17 · 💻 cs.LG · stat.ML

Sample Complexity Bounds for Stochastic Shortest Path with a Generative Model

Jean Tarbouriech , Matteo Pirotta , Michal Valko , Alessandro Lazaric This is my paper

Pith reviewed 2026-05-10 09:19 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords stochastic shortest pathsample complexitygenerative modelreinforcement learninglower boundepsilon-optimal policyhitting time

0 comments

The pith

There exists a worst-case stochastic shortest path instance where any algorithm needs at least order S A B cubed over minimum cost epsilon squared samples to learn an epsilon-optimal policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a lower bound on the number of samples needed to learn an epsilon-optimal policy in the stochastic shortest path problem when a generative model is available. It shows that this lower bound scales with the cube of the maximum expected cost of the optimal policy divided by the minimum cost per step. This reveals that the problem can be unlearnable when the minimum cost is zero, unlike in finite-horizon or discounted settings. The authors also provide algorithms that achieve matching upper bounds under appropriate conditions.

Core claim

We show that there exists a worst-case SSP instance with S states, A actions, minimum cost c_min, and maximum expected cost of the optimal policy over all states B_*, where any algorithm requires at least Omega(S A B_* ^3 / (c_min epsilon^2)) samples to return an epsilon-optimal policy with high probability. Surprisingly, this implies that whenever c_min = 0 an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this lower bound with an algorithm that matches it, up to logarithmic factors, in the general case, and an algorithm that matches it up to logarithmic factors even when c_min = 0,

What carries the argument

The worst-case SSP instance parameterized by number of states S, actions A, minimum cost c_min, and maximum expected optimal cost B_*. It forces any algorithm to collect many samples to distinguish the optimal policy.

If this is right

An algorithm exists that matches the lower bound up to log factors in the general case with positive minimum cost.
When minimum cost is zero, a matching algorithm works only if the optimal policy has bounded hitting time to the goal.
Learning in SSP is strictly harder than in finite-horizon and discounted settings.
The cubic dependence on B_* highlights the extra difficulty of long paths to the goal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that reinforcement learning for undiscounted goal-reaching tasks requires explicit safeguards against long or infinite trajectories.
The results point to the need for new techniques when costs can reach zero, beyond standard methods for discounted problems.
Similar lower bounds may apply to other continuing-task settings in reinforcement learning.

Load-bearing premise

The optimal policy has bounded expected hitting time to the goal when the minimum cost is zero, without which the problem may be unlearnable.

What would settle it

Constructing an algorithm that uses asymptotically fewer than Omega(S A B_* ^3 / (c_min epsilon^2)) samples to achieve epsilon-optimality with high probability on the worst-case instance would falsify the lower bound.

read the original abstract

We study the sample complexity of learning an $\epsilon$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{\min}$, and maximum expected cost of the optimal policy over all states $B_{\star}$, where any algorithm requires at least $\Omega(SAB_{\star}^3/(c_{\min}\epsilon^2))$ samples to return an $\epsilon$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{\min} = 0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this lower bound with an algorithm that matches it, up to logarithmic factors, in the general case, and an algorithm that matches it up to logarithmic factors even when $c_{\min} = 0$, but only under the condition that the optimal policy has a bounded hitting time to the goal state.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a tight Omega(S A B_star^3 / (c_min epsilon^2)) lower bound for SSP sample complexity with a generative model, matched by algorithms, and shows unlearnability when c_min hits zero without extra assumptions.

read the letter

The key point is that the authors establish matching upper and lower bounds for the sample complexity of epsilon-optimal policy learning in Stochastic Shortest Path with a generative model. The lower bound scales as Omega of S A B star to the third over c min epsilon squared, and this leads to the surprising result that the problem may not be learnable at all if c min is zero. This is new compared to prior bounds in finite-horizon and discounted cases, as the abstract notes. The work does well in constructing a hard instance that forces this dependence and then providing algorithms that achieve nearly the same rate in the general case, plus a variant for the zero min cost scenario when the optimal policy has bounded hitting time to the goal. The bounds appear well-supported, with the lower bound coming from a carefully chosen worst-case SSP instance and the upper bounds from efficient use of the generative model to estimate the necessary quantities. There is no sign of circular reasoning or unverified assumptions beyond the stated ones. One area that could use more attention is the practical relevance of the bounded hitting time condition for the c min zero algorithm. While the abstract sketches why it's necessary, seeing how it affects common SSP applications would strengthen the paper. The handling of logarithmic factors and high-probability guarantees seems standard but should be verified in the proofs. This paper targets theorists interested in the sample complexity of goal-oriented reinforcement learning. Readers working on planning with generative models or comparing different MDP settings will find the qualitative gap useful. I recommend sending it for peer review. The central result is interesting enough and the evidence from the bounds is sharp enough to warrant referee input.

Referee Report

2 major / 2 minor

Summary. The paper studies the sample complexity of learning an ε-optimal policy in the Stochastic Shortest Path (SSP) problem with access to a generative model. It derives a lower bound showing that there exists a worst-case SSP instance requiring Ω(S A B_*³ / (c_min ε²)) samples, implying potential unlearnability when c_min=0. It complements this with an algorithm matching the bound (up to logs) in the general case and a second algorithm that matches it when c_min=0 under the additional assumption that the optimal policy has bounded hitting time to the goal.

Significance. If the bounds hold, the work establishes that SSP learning is strictly harder than finite-horizon or discounted MDP settings due to the divergence of the lower bound as c_min approaches 0. The matching upper and lower bounds, along with the explicit condition for the c_min=0 case, provide a precise characterization of learnability in SSP and highlight the role of minimum cost and hitting times.

major comments (2)

[Lower bound theorem] Lower bound section (the theorem stating Ω(S A B_*³ / (c_min ε²))): the construction of the hard instance must be expanded to explicitly derive the cubic dependence on B_* and show how variance terms in the cost distributions are controlled to achieve the stated sample lower bound; without this, the tightness claim cannot be verified against the abstract's description.
[Algorithm for c_min=0] Upper bound for c_min=0 (the second algorithm and its analysis): the proof relies on the bounded expected hitting time assumption, but the handling of variance terms and the precise sample complexity (including how the bound remains finite) is only sketched; this assumption is load-bearing for the claim that the problem is learnable under the condition, and its necessity requires a more detailed argument.

minor comments (2)

[Abstract] The abstract states that the algorithms match the lower bound 'up to logarithmic factors' but does not specify the precise logarithmic dependence on S, A, B_*, or 1/ε; this should be stated explicitly in the theorem statements.
[Notation and preliminaries] Notation for B_* (maximum expected cost of the optimal policy over all states) and c_min should be consistently defined and cross-referenced in all theorem statements and algorithm descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We are pleased that the referee recognizes the significance of our results establishing the sample complexity of SSP with a generative model. We address the major comments below.

read point-by-point responses

Referee: [Lower bound theorem] Lower bound section (the theorem stating Ω(S A B_*³ / (c_min ε²))): the construction of the hard instance must be expanded to explicitly derive the cubic dependence on B_* and show how variance terms in the cost distributions are controlled to achieve the stated sample lower bound; without this, the tightness claim cannot be verified against the abstract's description.

Authors: We agree that the lower bound construction would benefit from expanded details to make the cubic dependence on B_* and variance control fully explicit. In the revised manuscript, we will augment the proof of the lower bound theorem with a step-by-step derivation of the hard instance, including the specific cost distributions chosen and how their variances are bounded so that they do not affect the Ω(B_*³) term in the sample complexity. This will allow direct verification of the claimed tightness. revision: yes
Referee: [Algorithm for c_min=0] Upper bound for c_min=0 (the second algorithm and its analysis): the proof relies on the bounded expected hitting time assumption, but the handling of variance terms and the precise sample complexity (including how the bound remains finite) is only sketched; this assumption is load-bearing for the claim that the problem is learnable under the condition, and its necessity requires a more detailed argument.

Authors: We acknowledge that the analysis for the c_min=0 case under bounded hitting time is currently sketched. In the revision we will supply a complete proof that explicitly handles variance terms in the cost estimates, derives the precise (finite) sample complexity, and shows how the bounded hitting time assumption ensures the bound remains finite while matching the lower bound up to logarithmic factors. We will also clarify the necessity of this assumption by directly linking it to the lower bound, which already demonstrates that SSP may be unlearnable when c_min=0 without additional conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper constructs an explicit worst-case SSP instance to establish the Omega(S A B_* ^3 / (c_min epsilon^2)) lower bound and then provides separate algorithmic upper bounds that match it (up to logs) in the general case and under the additional bounded-hitting-time condition when c_min=0. These steps rely on standard minimax analysis and algorithm design against external problem instances rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation. The derivation remains self-contained against the generative-model SSP setting with no equations or claims that collapse to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Analysis rests on standard finite-state-action SSP assumptions plus the bounded-hitting-time condition for the zero-cost case; no new free parameters or invented entities are introduced.

axioms (2)

standard math Finite state and action spaces
Required for any sample-complexity statement in tabular RL; implicit in the S and A factors.
domain assumption Existence of at least one proper policy (finite expected cost to goal)
Standard SSP assumption needed for the problem to be well-posed; the bounded-hitting-time condition strengthens it for the c_min=0 algorithm.

pith-pipeline@v0.9.0 · 5506 in / 1420 out tokens · 53462 ms · 2026-05-10T09:19:39.489011+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Model-based reinforcement learning with a generative model is minimax optimal

Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67--83. PMLR, 2020

work page 2020
[2]

Minimax pac bounds on the sample complexity of reinforcement learning with a generative model

Mohammad Gheshlaghi Azar, R \'e mi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91 0 (3): 0 325--349, 2013

work page 2013
[3]

Minimax regret bounds for reinforcement learning

Mohammad Gheshlaghi Azar, Ian Osband, and R \'e mi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263--272. JMLR. org, 2017

work page 2017
[4]

Dynamic programming and optimal control, volume 1

Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995

work page 1995
[5]

An analysis of stochastic shortest path problems

Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16 0 (3): 0 580--595, 1991

work page 1991
[6]

Stochastic shortest path problems under weak conditions

Dimitri P Bertsekas and Huizhen Yu. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, 2013

work page 2013
[7]

Improved analysis of UCRL2 with empirical bernstein inequality

Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of UCRL2 with empirical bernstein inequality. arXiv preprint arXiv:2007.05456, 2020

work page arXiv 2007
[8]

Chi Jin, Zeyuan Allen - Zhu, S \' e bastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient? In NeurIPS, pages 4868--4878, 2018

work page 2018
[9]

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Chi Jin , Tiancheng Jin , Haipeng Luo , Suvrit Sra , and Tiancheng Yu . Learning Adversarial MDPs with Bandit Feedback and Unknown Transition . In International Conference on Machine Learning, pages 4860--4869. PMLR, 2020

work page 2020
[10]

Conservative contextual linear bandits

Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi, and Benjamin Van Roy. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3910--3919, 2017

work page 2017
[11]

Near-optimal reinforcement learning in polynomial time

Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49 0 (2-3): 0 209--232, 2002

work page 2002
[12]

Breaking the sample size barrier in model-based reinforcement learning with a generative model

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. In Advances in Neural Information Processing Systems, volume 33, pages 12861--12872, 2020

work page 2020
[13]

Near-optimal regret bounds for stochastic shortest path

Aviv Rosenberg, Alon Cohen, Yishay Mansour, and Haim Kaplan. Near-optimal regret bounds for stochastic shortest path. In International Conference on Machine Learning, pages 8210--8219. PMLR, 2020

work page 2020
[14]

Near-optimal time and sample complexities for solving markov decision processes with a generative model

Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5186--5196, 2018 a

work page 2018
[15]

Variance reduced value iteration and faster algorithms for solving markov decision processes

Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770--787. SIAM, 2018 b

work page 2018
[16]

No-regret exploration in goal-oriented reinforcement learning

Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric. No-regret exploration in goal-oriented reinforcement learning. In International Conference on Machine Learning, pages 9428--9437. PMLR, 2020 a

work page 2020
[17]

Improved sample complexity for incremental autonomous exploration in mdps

Jean Tarbouriech, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. Improved sample complexity for incremental autonomous exploration in mdps. In Advances in Neural Information Processing Systems, volume 33, pages 11273--11284, 2020 b

work page 2020
[18]

Randomized linear programming solves the discounted markov decision problem in nearly-linear running time

Mengdi Wang. Randomized linear programming solves the discounted markov decision problem in nearly-linear running time. arXiv preprint arXiv:1704.01869, 2017

work page arXiv 2017
[19]

Almost horizon-free structure-aware best policy identification with a generative model

Andrea Zanette, Mykel J Kochenderfer, and Emma Brunskill. Almost horizon-free structure-aware best policy identification with a generative model. In Advances in Neural Information Processing Systems, pages 5626--5635, 2019

work page 2019

[1] [1]

Model-based reinforcement learning with a generative model is minimax optimal

Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67--83. PMLR, 2020

work page 2020

[2] [2]

Minimax pac bounds on the sample complexity of reinforcement learning with a generative model

Mohammad Gheshlaghi Azar, R \'e mi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91 0 (3): 0 325--349, 2013

work page 2013

[3] [3]

Minimax regret bounds for reinforcement learning

Mohammad Gheshlaghi Azar, Ian Osband, and R \'e mi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263--272. JMLR. org, 2017

work page 2017

[4] [4]

Dynamic programming and optimal control, volume 1

Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995

work page 1995

[5] [5]

An analysis of stochastic shortest path problems

Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16 0 (3): 0 580--595, 1991

work page 1991

[6] [6]

Stochastic shortest path problems under weak conditions

Dimitri P Bertsekas and Huizhen Yu. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, 2013

work page 2013

[7] [7]

Improved analysis of UCRL2 with empirical bernstein inequality

Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of UCRL2 with empirical bernstein inequality. arXiv preprint arXiv:2007.05456, 2020

work page arXiv 2007

[8] [8]

Chi Jin, Zeyuan Allen - Zhu, S \' e bastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient? In NeurIPS, pages 4868--4878, 2018

work page 2018

[9] [9]

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Chi Jin , Tiancheng Jin , Haipeng Luo , Suvrit Sra , and Tiancheng Yu . Learning Adversarial MDPs with Bandit Feedback and Unknown Transition . In International Conference on Machine Learning, pages 4860--4869. PMLR, 2020

work page 2020

[10] [10]

Conservative contextual linear bandits

Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi, and Benjamin Van Roy. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3910--3919, 2017

work page 2017

[11] [11]

Near-optimal reinforcement learning in polynomial time

Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49 0 (2-3): 0 209--232, 2002

work page 2002

[12] [12]

Breaking the sample size barrier in model-based reinforcement learning with a generative model

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. In Advances in Neural Information Processing Systems, volume 33, pages 12861--12872, 2020

work page 2020

[13] [13]

Near-optimal regret bounds for stochastic shortest path

Aviv Rosenberg, Alon Cohen, Yishay Mansour, and Haim Kaplan. Near-optimal regret bounds for stochastic shortest path. In International Conference on Machine Learning, pages 8210--8219. PMLR, 2020

work page 2020

[14] [14]

Near-optimal time and sample complexities for solving markov decision processes with a generative model

Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5186--5196, 2018 a

work page 2018

[15] [15]

Variance reduced value iteration and faster algorithms for solving markov decision processes

Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770--787. SIAM, 2018 b

work page 2018

[16] [16]

No-regret exploration in goal-oriented reinforcement learning

Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric. No-regret exploration in goal-oriented reinforcement learning. In International Conference on Machine Learning, pages 9428--9437. PMLR, 2020 a

work page 2020

[17] [17]

Improved sample complexity for incremental autonomous exploration in mdps

Jean Tarbouriech, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. Improved sample complexity for incremental autonomous exploration in mdps. In Advances in Neural Information Processing Systems, volume 33, pages 11273--11284, 2020 b

work page 2020

[18] [18]

Randomized linear programming solves the discounted markov decision problem in nearly-linear running time

Mengdi Wang. Randomized linear programming solves the discounted markov decision problem in nearly-linear running time. arXiv preprint arXiv:1704.01869, 2017

work page arXiv 2017

[19] [19]

Almost horizon-free structure-aware best policy identification with a generative model

Andrea Zanette, Mykel J Kochenderfer, and Emma Brunskill. Almost horizon-free structure-aware best policy identification with a generative model. In Advances in Neural Information Processing Systems, pages 5626--5635, 2019

work page 2019