pith. sign in

arxiv: 2604.16111 · v1 · submitted 2026-04-17 · 💻 cs.LG · stat.ML

Sample Complexity Bounds for Stochastic Shortest Path with a Generative Model

Pith reviewed 2026-05-10 09:19 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords stochastic shortest pathsample complexitygenerative modelreinforcement learninglower boundepsilon-optimal policyhitting time
0
0 comments X

The pith

There exists a worst-case stochastic shortest path instance where any algorithm needs at least order S A B cubed over minimum cost epsilon squared samples to learn an epsilon-optimal policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a lower bound on the number of samples needed to learn an epsilon-optimal policy in the stochastic shortest path problem when a generative model is available. It shows that this lower bound scales with the cube of the maximum expected cost of the optimal policy divided by the minimum cost per step. This reveals that the problem can be unlearnable when the minimum cost is zero, unlike in finite-horizon or discounted settings. The authors also provide algorithms that achieve matching upper bounds under appropriate conditions.

Core claim

We show that there exists a worst-case SSP instance with S states, A actions, minimum cost c_min, and maximum expected cost of the optimal policy over all states B_*, where any algorithm requires at least Omega(S A B_* ^3 / (c_min epsilon^2)) samples to return an epsilon-optimal policy with high probability. Surprisingly, this implies that whenever c_min = 0 an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this lower bound with an algorithm that matches it, up to logarithmic factors, in the general case, and an algorithm that matches it up to logarithmic factors even when c_min = 0,

What carries the argument

The worst-case SSP instance parameterized by number of states S, actions A, minimum cost c_min, and maximum expected optimal cost B_*. It forces any algorithm to collect many samples to distinguish the optimal policy.

If this is right

  • An algorithm exists that matches the lower bound up to log factors in the general case with positive minimum cost.
  • When minimum cost is zero, a matching algorithm works only if the optimal policy has bounded hitting time to the goal.
  • Learning in SSP is strictly harder than in finite-horizon and discounted settings.
  • The cubic dependence on B_* highlights the extra difficulty of long paths to the goal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that reinforcement learning for undiscounted goal-reaching tasks requires explicit safeguards against long or infinite trajectories.
  • The results point to the need for new techniques when costs can reach zero, beyond standard methods for discounted problems.
  • Similar lower bounds may apply to other continuing-task settings in reinforcement learning.

Load-bearing premise

The optimal policy has bounded expected hitting time to the goal when the minimum cost is zero, without which the problem may be unlearnable.

What would settle it

Constructing an algorithm that uses asymptotically fewer than Omega(S A B_* ^3 / (c_min epsilon^2)) samples to achieve epsilon-optimality with high probability on the worst-case instance would falsify the lower bound.

read the original abstract

We study the sample complexity of learning an $\epsilon$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{\min}$, and maximum expected cost of the optimal policy over all states $B_{\star}$, where any algorithm requires at least $\Omega(SAB_{\star}^3/(c_{\min}\epsilon^2))$ samples to return an $\epsilon$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{\min} = 0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this lower bound with an algorithm that matches it, up to logarithmic factors, in the general case, and an algorithm that matches it up to logarithmic factors even when $c_{\min} = 0$, but only under the condition that the optimal policy has a bounded hitting time to the goal state.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies the sample complexity of learning an ε-optimal policy in the Stochastic Shortest Path (SSP) problem with access to a generative model. It derives a lower bound showing that there exists a worst-case SSP instance requiring Ω(S A B_*³ / (c_min ε²)) samples, implying potential unlearnability when c_min=0. It complements this with an algorithm matching the bound (up to logs) in the general case and a second algorithm that matches it when c_min=0 under the additional assumption that the optimal policy has bounded hitting time to the goal.

Significance. If the bounds hold, the work establishes that SSP learning is strictly harder than finite-horizon or discounted MDP settings due to the divergence of the lower bound as c_min approaches 0. The matching upper and lower bounds, along with the explicit condition for the c_min=0 case, provide a precise characterization of learnability in SSP and highlight the role of minimum cost and hitting times.

major comments (2)
  1. [Lower bound theorem] Lower bound section (the theorem stating Ω(S A B_*³ / (c_min ε²))): the construction of the hard instance must be expanded to explicitly derive the cubic dependence on B_* and show how variance terms in the cost distributions are controlled to achieve the stated sample lower bound; without this, the tightness claim cannot be verified against the abstract's description.
  2. [Algorithm for c_min=0] Upper bound for c_min=0 (the second algorithm and its analysis): the proof relies on the bounded expected hitting time assumption, but the handling of variance terms and the precise sample complexity (including how the bound remains finite) is only sketched; this assumption is load-bearing for the claim that the problem is learnable under the condition, and its necessity requires a more detailed argument.
minor comments (2)
  1. [Abstract] The abstract states that the algorithms match the lower bound 'up to logarithmic factors' but does not specify the precise logarithmic dependence on S, A, B_*, or 1/ε; this should be stated explicitly in the theorem statements.
  2. [Notation and preliminaries] Notation for B_* (maximum expected cost of the optimal policy over all states) and c_min should be consistently defined and cross-referenced in all theorem statements and algorithm descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We are pleased that the referee recognizes the significance of our results establishing the sample complexity of SSP with a generative model. We address the major comments below.

read point-by-point responses
  1. Referee: [Lower bound theorem] Lower bound section (the theorem stating Ω(S A B_*³ / (c_min ε²))): the construction of the hard instance must be expanded to explicitly derive the cubic dependence on B_* and show how variance terms in the cost distributions are controlled to achieve the stated sample lower bound; without this, the tightness claim cannot be verified against the abstract's description.

    Authors: We agree that the lower bound construction would benefit from expanded details to make the cubic dependence on B_* and variance control fully explicit. In the revised manuscript, we will augment the proof of the lower bound theorem with a step-by-step derivation of the hard instance, including the specific cost distributions chosen and how their variances are bounded so that they do not affect the Ω(B_*³) term in the sample complexity. This will allow direct verification of the claimed tightness. revision: yes

  2. Referee: [Algorithm for c_min=0] Upper bound for c_min=0 (the second algorithm and its analysis): the proof relies on the bounded expected hitting time assumption, but the handling of variance terms and the precise sample complexity (including how the bound remains finite) is only sketched; this assumption is load-bearing for the claim that the problem is learnable under the condition, and its necessity requires a more detailed argument.

    Authors: We acknowledge that the analysis for the c_min=0 case under bounded hitting time is currently sketched. In the revision we will supply a complete proof that explicitly handles variance terms in the cost estimates, derives the precise (finite) sample complexity, and shows how the bounded hitting time assumption ensures the bound remains finite while matching the lower bound up to logarithmic factors. We will also clarify the necessity of this assumption by directly linking it to the lower bound, which already demonstrates that SSP may be unlearnable when c_min=0 without additional conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper constructs an explicit worst-case SSP instance to establish the Omega(S A B_* ^3 / (c_min epsilon^2)) lower bound and then provides separate algorithmic upper bounds that match it (up to logs) in the general case and under the additional bounded-hitting-time condition when c_min=0. These steps rely on standard minimax analysis and algorithm design against external problem instances rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation. The derivation remains self-contained against the generative-model SSP setting with no equations or claims that collapse to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Analysis rests on standard finite-state-action SSP assumptions plus the bounded-hitting-time condition for the zero-cost case; no new free parameters or invented entities are introduced.

axioms (2)
  • standard math Finite state and action spaces
    Required for any sample-complexity statement in tabular RL; implicit in the S and A factors.
  • domain assumption Existence of at least one proper policy (finite expected cost to goal)
    Standard SSP assumption needed for the problem to be well-posed; the bounded-hitting-time condition strengthens it for the c_min=0 algorithm.

pith-pipeline@v0.9.0 · 5506 in / 1420 out tokens · 53462 ms · 2026-05-10T09:19:39.489011+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Model-based reinforcement learning with a generative model is minimax optimal

    Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67--83. PMLR, 2020

  2. [2]

    Minimax pac bounds on the sample complexity of reinforcement learning with a generative model

    Mohammad Gheshlaghi Azar, R \'e mi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91 0 (3): 0 325--349, 2013

  3. [3]

    Minimax regret bounds for reinforcement learning

    Mohammad Gheshlaghi Azar, Ian Osband, and R \'e mi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263--272. JMLR. org, 2017

  4. [4]

    Dynamic programming and optimal control, volume 1

    Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995

  5. [5]

    An analysis of stochastic shortest path problems

    Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16 0 (3): 0 580--595, 1991

  6. [6]

    Stochastic shortest path problems under weak conditions

    Dimitri P Bertsekas and Huizhen Yu. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, 2013

  7. [7]

    Improved analysis of UCRL2 with empirical bernstein inequality

    Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of UCRL2 with empirical bernstein inequality. arXiv preprint arXiv:2007.05456, 2020

  8. [8]

    Chi Jin, Zeyuan Allen - Zhu, S \' e bastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient? In NeurIPS, pages 4868--4878, 2018

  9. [9]

    Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

    Chi Jin , Tiancheng Jin , Haipeng Luo , Suvrit Sra , and Tiancheng Yu . Learning Adversarial MDPs with Bandit Feedback and Unknown Transition . In International Conference on Machine Learning, pages 4860--4869. PMLR, 2020

  10. [10]

    Conservative contextual linear bandits

    Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi, and Benjamin Van Roy. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3910--3919, 2017

  11. [11]

    Near-optimal reinforcement learning in polynomial time

    Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49 0 (2-3): 0 209--232, 2002

  12. [12]

    Breaking the sample size barrier in model-based reinforcement learning with a generative model

    Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. In Advances in Neural Information Processing Systems, volume 33, pages 12861--12872, 2020

  13. [13]

    Near-optimal regret bounds for stochastic shortest path

    Aviv Rosenberg, Alon Cohen, Yishay Mansour, and Haim Kaplan. Near-optimal regret bounds for stochastic shortest path. In International Conference on Machine Learning, pages 8210--8219. PMLR, 2020

  14. [14]

    Near-optimal time and sample complexities for solving markov decision processes with a generative model

    Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5186--5196, 2018 a

  15. [15]

    Variance reduced value iteration and faster algorithms for solving markov decision processes

    Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770--787. SIAM, 2018 b

  16. [16]

    No-regret exploration in goal-oriented reinforcement learning

    Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric. No-regret exploration in goal-oriented reinforcement learning. In International Conference on Machine Learning, pages 9428--9437. PMLR, 2020 a

  17. [17]

    Improved sample complexity for incremental autonomous exploration in mdps

    Jean Tarbouriech, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. Improved sample complexity for incremental autonomous exploration in mdps. In Advances in Neural Information Processing Systems, volume 33, pages 11273--11284, 2020 b

  18. [18]

    Randomized linear programming solves the discounted markov decision problem in nearly-linear running time

    Mengdi Wang. Randomized linear programming solves the discounted markov decision problem in nearly-linear running time. arXiv preprint arXiv:1704.01869, 2017

  19. [19]

    Almost horizon-free structure-aware best policy identification with a generative model

    Andrea Zanette, Mykel J Kochenderfer, and Emma Brunskill. Almost horizon-free structure-aware best policy identification with a generative model. In Advances in Neural Information Processing Systems, pages 5626--5635, 2019