Sample Complexity Bounds for Stochastic Shortest Path with a Generative Model
Pith reviewed 2026-05-10 09:19 UTC · model grok-4.3
The pith
There exists a worst-case stochastic shortest path instance where any algorithm needs at least order S A B cubed over minimum cost epsilon squared samples to learn an epsilon-optimal policy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that there exists a worst-case SSP instance with S states, A actions, minimum cost c_min, and maximum expected cost of the optimal policy over all states B_*, where any algorithm requires at least Omega(S A B_* ^3 / (c_min epsilon^2)) samples to return an epsilon-optimal policy with high probability. Surprisingly, this implies that whenever c_min = 0 an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this lower bound with an algorithm that matches it, up to logarithmic factors, in the general case, and an algorithm that matches it up to logarithmic factors even when c_min = 0,
What carries the argument
The worst-case SSP instance parameterized by number of states S, actions A, minimum cost c_min, and maximum expected optimal cost B_*. It forces any algorithm to collect many samples to distinguish the optimal policy.
If this is right
- An algorithm exists that matches the lower bound up to log factors in the general case with positive minimum cost.
- When minimum cost is zero, a matching algorithm works only if the optimal policy has bounded hitting time to the goal.
- Learning in SSP is strictly harder than in finite-horizon and discounted settings.
- The cubic dependence on B_* highlights the extra difficulty of long paths to the goal.
Where Pith is reading between the lines
- This suggests that reinforcement learning for undiscounted goal-reaching tasks requires explicit safeguards against long or infinite trajectories.
- The results point to the need for new techniques when costs can reach zero, beyond standard methods for discounted problems.
- Similar lower bounds may apply to other continuing-task settings in reinforcement learning.
Load-bearing premise
The optimal policy has bounded expected hitting time to the goal when the minimum cost is zero, without which the problem may be unlearnable.
What would settle it
Constructing an algorithm that uses asymptotically fewer than Omega(S A B_* ^3 / (c_min epsilon^2)) samples to achieve epsilon-optimality with high probability on the worst-case instance would falsify the lower bound.
read the original abstract
We study the sample complexity of learning an $\epsilon$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{\min}$, and maximum expected cost of the optimal policy over all states $B_{\star}$, where any algorithm requires at least $\Omega(SAB_{\star}^3/(c_{\min}\epsilon^2))$ samples to return an $\epsilon$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{\min} = 0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this lower bound with an algorithm that matches it, up to logarithmic factors, in the general case, and an algorithm that matches it up to logarithmic factors even when $c_{\min} = 0$, but only under the condition that the optimal policy has a bounded hitting time to the goal state.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies the sample complexity of learning an ε-optimal policy in the Stochastic Shortest Path (SSP) problem with access to a generative model. It derives a lower bound showing that there exists a worst-case SSP instance requiring Ω(S A B_*³ / (c_min ε²)) samples, implying potential unlearnability when c_min=0. It complements this with an algorithm matching the bound (up to logs) in the general case and a second algorithm that matches it when c_min=0 under the additional assumption that the optimal policy has bounded hitting time to the goal.
Significance. If the bounds hold, the work establishes that SSP learning is strictly harder than finite-horizon or discounted MDP settings due to the divergence of the lower bound as c_min approaches 0. The matching upper and lower bounds, along with the explicit condition for the c_min=0 case, provide a precise characterization of learnability in SSP and highlight the role of minimum cost and hitting times.
major comments (2)
- [Lower bound theorem] Lower bound section (the theorem stating Ω(S A B_*³ / (c_min ε²))): the construction of the hard instance must be expanded to explicitly derive the cubic dependence on B_* and show how variance terms in the cost distributions are controlled to achieve the stated sample lower bound; without this, the tightness claim cannot be verified against the abstract's description.
- [Algorithm for c_min=0] Upper bound for c_min=0 (the second algorithm and its analysis): the proof relies on the bounded expected hitting time assumption, but the handling of variance terms and the precise sample complexity (including how the bound remains finite) is only sketched; this assumption is load-bearing for the claim that the problem is learnable under the condition, and its necessity requires a more detailed argument.
minor comments (2)
- [Abstract] The abstract states that the algorithms match the lower bound 'up to logarithmic factors' but does not specify the precise logarithmic dependence on S, A, B_*, or 1/ε; this should be stated explicitly in the theorem statements.
- [Notation and preliminaries] Notation for B_* (maximum expected cost of the optimal policy over all states) and c_min should be consistently defined and cross-referenced in all theorem statements and algorithm descriptions.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments. We are pleased that the referee recognizes the significance of our results establishing the sample complexity of SSP with a generative model. We address the major comments below.
read point-by-point responses
-
Referee: [Lower bound theorem] Lower bound section (the theorem stating Ω(S A B_*³ / (c_min ε²))): the construction of the hard instance must be expanded to explicitly derive the cubic dependence on B_* and show how variance terms in the cost distributions are controlled to achieve the stated sample lower bound; without this, the tightness claim cannot be verified against the abstract's description.
Authors: We agree that the lower bound construction would benefit from expanded details to make the cubic dependence on B_* and variance control fully explicit. In the revised manuscript, we will augment the proof of the lower bound theorem with a step-by-step derivation of the hard instance, including the specific cost distributions chosen and how their variances are bounded so that they do not affect the Ω(B_*³) term in the sample complexity. This will allow direct verification of the claimed tightness. revision: yes
-
Referee: [Algorithm for c_min=0] Upper bound for c_min=0 (the second algorithm and its analysis): the proof relies on the bounded expected hitting time assumption, but the handling of variance terms and the precise sample complexity (including how the bound remains finite) is only sketched; this assumption is load-bearing for the claim that the problem is learnable under the condition, and its necessity requires a more detailed argument.
Authors: We acknowledge that the analysis for the c_min=0 case under bounded hitting time is currently sketched. In the revision we will supply a complete proof that explicitly handles variance terms in the cost estimates, derives the precise (finite) sample complexity, and shows how the bounded hitting time assumption ensures the bound remains finite while matching the lower bound up to logarithmic factors. We will also clarify the necessity of this assumption by directly linking it to the lower bound, which already demonstrates that SSP may be unlearnable when c_min=0 without additional conditions. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper constructs an explicit worst-case SSP instance to establish the Omega(S A B_* ^3 / (c_min epsilon^2)) lower bound and then provides separate algorithmic upper bounds that match it (up to logs) in the general case and under the additional bounded-hitting-time condition when c_min=0. These steps rely on standard minimax analysis and algorithm design against external problem instances rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation. The derivation remains self-contained against the generative-model SSP setting with no equations or claims that collapse to their own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Finite state and action spaces
- domain assumption Existence of at least one proper policy (finite expected cost to goal)
Reference graph
Works this paper leans on
-
[1]
Model-based reinforcement learning with a generative model is minimax optimal
Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67--83. PMLR, 2020
work page 2020
-
[2]
Minimax pac bounds on the sample complexity of reinforcement learning with a generative model
Mohammad Gheshlaghi Azar, R \'e mi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91 0 (3): 0 325--349, 2013
work page 2013
-
[3]
Minimax regret bounds for reinforcement learning
Mohammad Gheshlaghi Azar, Ian Osband, and R \'e mi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263--272. JMLR. org, 2017
work page 2017
-
[4]
Dynamic programming and optimal control, volume 1
Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995
work page 1995
-
[5]
An analysis of stochastic shortest path problems
Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16 0 (3): 0 580--595, 1991
work page 1991
-
[6]
Stochastic shortest path problems under weak conditions
Dimitri P Bertsekas and Huizhen Yu. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, 2013
work page 2013
-
[7]
Improved analysis of UCRL2 with empirical bernstein inequality
Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of UCRL2 with empirical bernstein inequality. arXiv preprint arXiv:2007.05456, 2020
-
[8]
Chi Jin, Zeyuan Allen - Zhu, S \' e bastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient? In NeurIPS, pages 4868--4878, 2018
work page 2018
-
[9]
Learning Adversarial MDPs with Bandit Feedback and Unknown Transition
Chi Jin , Tiancheng Jin , Haipeng Luo , Suvrit Sra , and Tiancheng Yu . Learning Adversarial MDPs with Bandit Feedback and Unknown Transition . In International Conference on Machine Learning, pages 4860--4869. PMLR, 2020
work page 2020
-
[10]
Conservative contextual linear bandits
Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi, and Benjamin Van Roy. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3910--3919, 2017
work page 2017
-
[11]
Near-optimal reinforcement learning in polynomial time
Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49 0 (2-3): 0 209--232, 2002
work page 2002
-
[12]
Breaking the sample size barrier in model-based reinforcement learning with a generative model
Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. In Advances in Neural Information Processing Systems, volume 33, pages 12861--12872, 2020
work page 2020
-
[13]
Near-optimal regret bounds for stochastic shortest path
Aviv Rosenberg, Alon Cohen, Yishay Mansour, and Haim Kaplan. Near-optimal regret bounds for stochastic shortest path. In International Conference on Machine Learning, pages 8210--8219. PMLR, 2020
work page 2020
-
[14]
Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5186--5196, 2018 a
work page 2018
-
[15]
Variance reduced value iteration and faster algorithms for solving markov decision processes
Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770--787. SIAM, 2018 b
work page 2018
-
[16]
No-regret exploration in goal-oriented reinforcement learning
Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric. No-regret exploration in goal-oriented reinforcement learning. In International Conference on Machine Learning, pages 9428--9437. PMLR, 2020 a
work page 2020
-
[17]
Improved sample complexity for incremental autonomous exploration in mdps
Jean Tarbouriech, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. Improved sample complexity for incremental autonomous exploration in mdps. In Advances in Neural Information Processing Systems, volume 33, pages 11273--11284, 2020 b
work page 2020
-
[18]
Mengdi Wang. Randomized linear programming solves the discounted markov decision problem in nearly-linear running time. arXiv preprint arXiv:1704.01869, 2017
-
[19]
Almost horizon-free structure-aware best policy identification with a generative model
Andrea Zanette, Mykel J Kochenderfer, and Emma Brunskill. Almost horizon-free structure-aware best policy identification with a generative model. In Advances in Neural Information Processing Systems, pages 5626--5635, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.