Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

Lucas Lehnert; Nasehatul Mustakim

arxiv: 2605.20272 · v1 · pith:5LOGV7CSnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

Nasehatul Mustakim , Lucas Lehnert This is my paper

Pith reviewed 2026-05-21 07:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningstate abstractiongeneralizationPOMDPmodel reductionout-of-distribution

0 comments

The pith

Constraining RL agents to small finite abstract state sets enables generalization to more complex POMDP tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends state abstraction methods from MDPs to POMDPs and introduces a successor-weighted model reduction that compresses experiences into strictly smaller abstract spaces than earlier definitions permit. It then derives a performance bound that splits an agent's out-of-distribution loss into approximation error and estimation error, proving that the total loss shrinks when the number of abstract states is reduced. A sympathetic reader cares because the result supplies the first explicit conditions under which an RL agent can transfer successfully from simple training tasks to larger, unseen ones. The analysis therefore concludes that limiting the agent to a small, finite collection of abstract states is required for cross-scale generalization.

Core claim

By extending the state abstraction framework to POMDPs and defining successor-weighted model reduction, the paper obtains a bound on OOD test performance whose approximation and estimation error terms both decrease as abstract state space size is reduced. This establishes that agents restricted to smaller abstract spaces achieve better generalization to more complex tasks, and that such a restriction is necessary for scaling.

What carries the argument

Successor-weighted model reduction, a variant of model reduction that produces strictly smaller abstract state spaces in POMDPs while preserving the performance bound.

If this is right

Reducing abstract state space size lowers both approximation and estimation errors in the derived OOD performance bound.
Agents restricted to fewer abstract states transfer more reliably from small training tasks to larger test tasks.
Architectures that enforce compact abstract representations will exhibit better scaling across task sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RL training loops could add explicit penalties on abstract state space size to encourage the emergence of generalizing policies.
The same compression principle may explain why humans rely on a limited repertoire of abstract concepts when facing novel situations of greater complexity.
Empirical tests in continuous POMDPs with controllable observation noise could directly measure the predicted error reduction.

Load-bearing premise

Extending existing state abstraction proof techniques from MDPs to POMDPs remains valid and the new reduction achieves smaller spaces without loosening the error bounds.

What would settle it

Measure whether RL agents forced to use larger abstract state spaces in scaled-up POMDPs exhibit strictly higher OOD error than agents using the smaller successor-weighted spaces; a reversal of this ordering would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.20272 by Lucas Lehnert, Nasehatul Mustakim.

**Figure 1.** Figure 1: Cross-scale generalization in the warm–cold lattice environment. (a): In the warm–cold lattice environment, the agent can navigate across an infinitely large lattice by moving up, down, left, or right to reach the rewarding goal location. The agent never observes its lattice position and is only given a warm (W) or cold (C) observation depending on it moving closer to or farther from the goal. To find the … view at source ↗

**Figure 2.** Figure 2: Out-of-distribution generalization via changing start-state distributions. Given two start-state distributions d train 0 and d test 0 , a POMDP P can be mapped to the history Markov Decision Processes (MDPs) MH train and MH test with state spaces that consist of the observation-action sequences that are observed since starting in a start state. These history MDPs generate the same reward sequences with equ… view at source ↗

**Figure 3.** Figure 3: Successor-weighted model reductions enable small abstract state spaces. (a): The agent starts out at a state N and then moves left towards the reward while observing first a left arrow (←) and then empty observations (∅) (until the goal state, the terminal state, is entered). This POMDP can be compressed into a two-state abstract MDP: The terminal reward state (green) is modelled as one abstract state wher… view at source ↗

**Figure 4.** Figure 4: The abstract state space size trades off test performance. Both plots visualize the bound presented in Corollary 1. the agent makes arbitrary predictions. Thus, for every abstract state-action pair (sϕ, a), |(rbMϕ train − rMϕ train )(sϕ, a)| = ( ε if sϕ ∈ Sϕ obs Rmax if sϕ ∈ Sϕ unk ∥(Pb Mϕ train − P Mϕ train )(sϕ, a)∥ 1 = ( ε if sϕ ∈ Sϕ obs 2 if sϕ ∈ Sϕ unk. Assumption 2 characterizes learning under a fixe… view at source ↗

**Figure 5.** Figure 5: Reducing the number of abstract states improves OOD generalization. (a): The number of mistakes (number of suboptimally selected actions) when starting the computed policy 50 steps away from the goal. For each history suffix length, each start state was sampled five times and the plot shows a standard box plot. The green curve shows the average number of suboptimally selected actions (average number of mis… view at source ↗

**Figure 6.** Figure 6: Sign-chain task example. (a): During training, the agent either starts at location A or B and immediately observes a sign: either move left (←) or move right (→). At all other locations the agent only observes the same ∅ observation. All transitions are deterministic, and the rewarding state is terminal. During testing, the agent either starts at location C or D. These states are placed at a higher and var… view at source ↗

read the original abstract

While humans readily generalize abstract concepts to more complex or larger tasks, building Reinforcement Learning (RL) systems with this ability remains elusive. Here, we present the first theoretical model of how such Out-of-Distribution (OOD) generalization can be achieved in RL agents. Our approach considers Partially Observable Markov Decision Processes (POMDPs) and assumes that an intelligent agent uses an abstraction function to determine which experiences can be treated as equivalent and which must be distinguished. First, we extend the existing state abstraction framework and proof techniques to POMDPs. Then, we define a successor-weighted model reduction, a model reduction variant that enables compression into smaller abstract spaces than prior definitions allow. We derive a bound on the agent's OOD test performance, thereby defining the conditions under which OOD generalization is achievable. This bound decomposes an agent's performance loss into approximation and estimation errors, revealing how reducing an agent's abstract state space size improves test performance and OOD generalization. Our analysis suggests that constraining an agent to operate over a small, finite set of abstract states is necessary for achieving generalization to more complex tasks. Our results motivate further research into learning RL architectures that scale across tasks of varying complexity levels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a bound tying smaller abstract state spaces to better OOD generalization in POMDPs via a new successor-weighted reduction, but the necessity claim looks shaky if the error decomposition picks up extra terms under partial observability.

read the letter

The main point is that the authors extend state abstraction techniques to POMDPs and introduce successor-weighted model reduction to reach smaller abstract spaces than earlier definitions. They then derive a performance bound that splits loss into approximation and estimation errors, and use it to argue that keeping the abstract state space small and finite is necessary for generalization to larger or more complex tasks.

Referee Report

2 major / 2 minor

Summary. The paper claims to provide the first theoretical model of OOD generalization in RL for POMDPs. It extends existing state abstraction frameworks and proof techniques to POMDPs, introduces a successor-weighted model reduction that compresses into strictly smaller abstract state spaces than prior definitions, and derives an OOD performance bound that decomposes loss into approximation and estimation errors. The analysis concludes that constraining agents to small finite abstract state spaces is necessary for generalization to more complex tasks.

Significance. If the bound derivation is valid and the POMDP extension preserves the required contraction/Lipschitz properties without extra cross-terms, the work supplies a concrete error decomposition linking abstraction size to cross-scale generalization. This would be a useful theoretical contribution, offering falsifiable conditions and motivating architectures that explicitly learn compact abstractions. The successor-weighted reduction and its claimed size advantage over prior methods are potentially valuable if rigorously shown to maintain bound validity.

major comments (2)

[POMDP extension and bound derivation] POMDP extension section (following the abstract's description of extending state abstraction frameworks): the necessity claim that smaller abstract spaces strictly improve OOD generalization rests on the performance bound decomposing cleanly into approximation and estimation errors. It is unclear whether the successor-weighted reduction, when applied to belief states rather than fully observed states, preserves the original MDP contraction or Lipschitz conditions; an additional cross-term scaling with task size could appear in the approximation error and invalidate the 'smaller is better' direction of the bound.
[Successor-weighted model reduction] Definition of successor-weighted model reduction: the claim that this variant yields strictly smaller abstract spaces than prior definitions without introducing inconsistencies requires an explicit comparison (e.g., to bisimulation or homomorphism-based abstractions) showing that the weighting preserves the error decomposition used in the OOD bound. Without this, the necessity conclusion for small finite spaces does not follow.

minor comments (2)

[Abstract and bound statement] The abstract states the bound 'reveals how reducing an agent's abstract state space size improves test performance,' but the manuscript should clarify the precise dependence (e.g., which term in the bound decreases monotonically with abstraction size) to avoid ambiguity.
[Preliminaries] Notation for the abstraction function and belief-state mapping should be introduced with explicit definitions before the bound is stated, to improve readability of the POMDP extension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the POMDP extension and the successor-weighted model reduction. We address each major point below and outline the revisions we will make to strengthen the presentation and clarify the technical details.

read point-by-point responses

Referee: [POMDP extension and bound derivation] POMDP extension section (following the abstract's description of extending state abstraction frameworks): the necessity claim that smaller abstract spaces strictly improve OOD generalization rests on the performance bound decomposing cleanly into approximation and estimation errors. It is unclear whether the successor-weighted reduction, when applied to belief states rather than fully observed states, preserves the original MDP contraction or Lipschitz conditions; an additional cross-term scaling with task size could appear in the approximation error and invalidate the 'smaller is better' direction of the bound.

Authors: We appreciate the referee highlighting the need to confirm preservation of contraction and Lipschitz properties when moving from MDPs to belief states in POMDPs. In our derivation, the successor-weighted reduction is defined on the space of belief distributions, and the performance bound is obtained by composing the abstraction error with the value function under the belief-update operator. Because the successor measure is taken with respect to the POMDP transition kernel and the weighting is a convex combination over reachable beliefs, the contraction factor remains strictly less than one and independent of task horizon or size; no additional cross-term scaling with task size enters the approximation error. The 'smaller is better' direction follows directly from the fact that the approximation error is monotone in the cardinality of the abstract space. We will revise the manuscript to insert a short lemma immediately after the definition of the reduction that explicitly verifies the Lipschitz constant carries over without extra terms. revision: yes
Referee: [Successor-weighted model reduction] Definition of successor-weighted model reduction: the claim that this variant yields strictly smaller abstract spaces than prior definitions without introducing inconsistencies requires an explicit comparison (e.g., to bisimulation or homomorphism-based abstractions) showing that the weighting preserves the error decomposition used in the OOD bound. Without this, the necessity conclusion for small finite spaces does not follow.

Authors: We agree that a side-by-side comparison would make the size advantage and preservation of the error decomposition more transparent. The successor-weighted reduction merges states whose successor distributions are close under the weighting, which is a strictly weaker equivalence than bisimulation (exact matching of all future distributions) or standard homomorphism abstractions. Consequently the resulting abstract space is at most as large as those produced by the earlier notions, and the approximation error term in the OOD bound is defined directly from the weighted distance, so the same decomposition into approximation plus estimation error continues to hold. We will add a new subsection that states a formal inclusion result (the successor-weighted partition refines neither but is coarser than bisimulation) together with a short proof that the bound remains valid under the weaker equivalence. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no load-bearing reduction to inputs or self-citations

full rationale

The paper first extends existing state-abstraction techniques to POMDPs, then introduces a successor-weighted model reduction that is defined to produce strictly smaller abstract spaces, and finally derives an OOD performance bound that decomposes loss into approximation and estimation terms. No equation or definition is shown to presuppose the final necessity claim about small finite abstract spaces; the bound is presented as a consequence of the extended framework rather than a restatement of the abstraction size. Self-citations, if present, are not load-bearing for the core derivation steps. The analysis therefore remains independent of its target conclusion.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; full list of assumptions, parameters, and entities cannot be extracted. The central claim rests on the existence of a suitable abstraction function and the validity of the POMDP extension.

axioms (2)

domain assumption An intelligent agent uses an abstraction function to determine which experiences can be treated as equivalent.
Stated in the abstract as the starting point for the model.
ad hoc to paper The successor-weighted model reduction can be defined and applied to POMDPs to produce smaller abstract spaces than prior methods.
Introduced as the key technical step enabling the bound.

pith-pipeline@v0.9.0 · 5737 in / 1339 out tokens · 32427 ms · 2026-05-21T07:21:35.756295+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

successor-weighted model reduction ... ∥rrr−Φ↓rrrϕ∥dddπ ≤ εϕr , ∥Φ↑PPP−Φ↓PPPϕ∥dddπ ≤ εϕp
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

constraining an agent to operate over a small, finite set of abstract states is necessary

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116

doi: 10.1073/pnas.1903070116. URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116. Christoph Dann, Teodor Vanislavov Marinov, Mehryar Mohri, and Julian Zimmert. Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforce- ment Learning. InAdvances in Neural Information Processing Systems, volume 34, pages 1–12. Curran...

work page doi:10.1073/pnas.1903070116
[2]

Peter Dayan

URLhttps://proceedings.neurips.cc/paper_ files/paper/2021/hash/000c076c390a4c357313fca29e390ece-Abstract.html. Peter Dayan. Improving Generalization for Temporal Difference Learning: The Successor Representation.Neural Computation, 5(4):613–624,

work page 2021
[3]

Improving Generalization for Temporal Difference Learning : The Successor Representation

ISSN 0899-7667. doi: 10.1162/ neco.1993.5.4.613. URLhttps://doi.org/10.1162/neco.1993.5.4.613. Finale Doshi-Velez. The infinite partially observable markov decision process.Advances in neural information processing systems, 22,

work page doi:10.1162/neco.1993.5.4.613 1993
[4]

URL https://journals.plos.org/ploscompbiol/article? id=10.1371/journal.pcbi.1006116

1371/journal.pcbi.1006116. URL https://journals.plos.org/ploscompbiol/article? id=10.1371/journal.pcbi.1006116. Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model mini- mization in Markov decision processes.Artificial Intelligence, 147(1):163–223,

work page doi:10.1371/journal.pcbi.1006116
[5]

doi: 10.1016/S0004-3702(02)00376-4

ISSN 0004-3702. doi: 10.1016/S0004-3702(02)00376-4. URLhttps://www.sciencedirect.com/ science/article/pii/S0004370202003764. Yuri Grinberg and Doina Precup. On average reward policy evaluation in infinite-state partially observable systems. InArtificial Intelligence and Statistics, pages 449–457. PMLR,

work page doi:10.1016/s0004-3702(02)00376-4
[6]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B

URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/ 6f5e4e86a87220e5d361ad82f1ebc335-Abstract.html. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models,

work page 2021
[7]

Scaling Laws for Neural Language Models

URLhttp://arxiv.org/abs/2001.08361. Michael Kearns and Satinder Singh. Near-Optimal Reinforcement Learning in Polynomial Time.Machine Learning, 49(2):209–232, November

work page internal anchor Pith review Pith/arXiv arXiv 2001
[8]

Cluster Computing 6(3), 215–226 (Jul 2003), https://doi.org/10.1023/A: 1023588520138

ISSN 1573-0565. doi: 10.1023/A: 1017984413808. AaronT.Kirtland, AlexanderIvanov, CameronAllen, MichaelLittman, andGeorgeKonidaris. Memory as state abstraction over history,

work page doi:10.1023/a:
[9]

Transfer with Model Features in Reinforcement Learning

URLhttps://arxiv.org/abs/1807.01736. Lucas Lehnert and Michael L. Littman. Successor Features Combine Elements of Model-Free and Model-based Reinforcement Learning.Journal of Machine Learning Research, 21(196): 1–53,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

doi: 10.1371/journal.pcbi.1008317

ISSN 1553-7358. doi: 10.1371/journal.pcbi.1008317. Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for mdps.AI&M, 1(2):3,

work page doi:10.1371/journal.pcbi.1008317
[11]

doi: https://doi.org/ 10.1016/j.artint.2022.103770

ISSN 0004-3702. doi: https://doi.org/ 10.1016/j.artint.2022.103770. URLhttps://www.sciencedirect.com/science/article/ pii/S0004370222001102. Sam Lobel and Ronald Parr. An optimal tightness bound for the simulation lemma.Rein- forcement Learning Journal, 2:785–797,

work page doi:10.1016/j.artint.2022.103770 2022
[12]

URLhttps: //dl.acm.org/doi/book/10.5555/2621980

ISBN 978-1-107-05713-5. URLhttps: //dl.acm.org/doi/book/10.5555/2621980. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy 43 Lillicrap, Madeleine ...

work page doi:10.5555/2621980
[13]

J., et al

ISSN 1476-4687. doi: 10.1038/nature16961. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2nd edition,

work page doi:10.1038/nature16961
[14]

Sutton, Doina Precup, and Satinder Singh

ISSN 0004-3702. doi: 10.1016/S0004-3702(99)00052-1. URL https://www.sciencedirect.com/science/article/pii/S0004370299000521. V. Vapnik. Principles of Risk Minimization for Learning Theory. InAdvances in Neural Information Processing Systems, volume

work page doi:10.1016/s0004-3702(99)00052-1
[15]

Watkins, P

ISSN 1573-0565. doi: 10.1007/BF00992698. URL https://doi.org/10.1007/ BF00992698. Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. InConference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR,

work page doi:10.1007/bf00992698

[1] [1]

URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116

doi: 10.1073/pnas.1903070116. URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116. Christoph Dann, Teodor Vanislavov Marinov, Mehryar Mohri, and Julian Zimmert. Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforce- ment Learning. InAdvances in Neural Information Processing Systems, volume 34, pages 1–12. Curran...

work page doi:10.1073/pnas.1903070116

[2] [2]

Peter Dayan

URLhttps://proceedings.neurips.cc/paper_ files/paper/2021/hash/000c076c390a4c357313fca29e390ece-Abstract.html. Peter Dayan. Improving Generalization for Temporal Difference Learning: The Successor Representation.Neural Computation, 5(4):613–624,

work page 2021

[3] [3]

Improving Generalization for Temporal Difference Learning : The Successor Representation

ISSN 0899-7667. doi: 10.1162/ neco.1993.5.4.613. URLhttps://doi.org/10.1162/neco.1993.5.4.613. Finale Doshi-Velez. The infinite partially observable markov decision process.Advances in neural information processing systems, 22,

work page doi:10.1162/neco.1993.5.4.613 1993

[4] [4]

URL https://journals.plos.org/ploscompbiol/article? id=10.1371/journal.pcbi.1006116

1371/journal.pcbi.1006116. URL https://journals.plos.org/ploscompbiol/article? id=10.1371/journal.pcbi.1006116. Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model mini- mization in Markov decision processes.Artificial Intelligence, 147(1):163–223,

work page doi:10.1371/journal.pcbi.1006116

[5] [5]

doi: 10.1016/S0004-3702(02)00376-4

ISSN 0004-3702. doi: 10.1016/S0004-3702(02)00376-4. URLhttps://www.sciencedirect.com/ science/article/pii/S0004370202003764. Yuri Grinberg and Doina Precup. On average reward policy evaluation in infinite-state partially observable systems. InArtificial Intelligence and Statistics, pages 449–457. PMLR,

work page doi:10.1016/s0004-3702(02)00376-4

[6] [6]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B

URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/ 6f5e4e86a87220e5d361ad82f1ebc335-Abstract.html. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models,

work page 2021

[7] [7]

Scaling Laws for Neural Language Models

URLhttp://arxiv.org/abs/2001.08361. Michael Kearns and Satinder Singh. Near-Optimal Reinforcement Learning in Polynomial Time.Machine Learning, 49(2):209–232, November

work page internal anchor Pith review Pith/arXiv arXiv 2001

[8] [8]

Cluster Computing 6(3), 215–226 (Jul 2003), https://doi.org/10.1023/A: 1023588520138

ISSN 1573-0565. doi: 10.1023/A: 1017984413808. AaronT.Kirtland, AlexanderIvanov, CameronAllen, MichaelLittman, andGeorgeKonidaris. Memory as state abstraction over history,

work page doi:10.1023/a:

[9] [9]

Transfer with Model Features in Reinforcement Learning

URLhttps://arxiv.org/abs/1807.01736. Lucas Lehnert and Michael L. Littman. Successor Features Combine Elements of Model-Free and Model-based Reinforcement Learning.Journal of Machine Learning Research, 21(196): 1–53,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

doi: 10.1371/journal.pcbi.1008317

ISSN 1553-7358. doi: 10.1371/journal.pcbi.1008317. Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for mdps.AI&M, 1(2):3,

work page doi:10.1371/journal.pcbi.1008317

[11] [11]

doi: https://doi.org/ 10.1016/j.artint.2022.103770

ISSN 0004-3702. doi: https://doi.org/ 10.1016/j.artint.2022.103770. URLhttps://www.sciencedirect.com/science/article/ pii/S0004370222001102. Sam Lobel and Ronald Parr. An optimal tightness bound for the simulation lemma.Rein- forcement Learning Journal, 2:785–797,

work page doi:10.1016/j.artint.2022.103770 2022

[12] [12]

URLhttps: //dl.acm.org/doi/book/10.5555/2621980

ISBN 978-1-107-05713-5. URLhttps: //dl.acm.org/doi/book/10.5555/2621980. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy 43 Lillicrap, Madeleine ...

work page doi:10.5555/2621980

[13] [13]

J., et al

ISSN 1476-4687. doi: 10.1038/nature16961. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2nd edition,

work page doi:10.1038/nature16961

[14] [14]

Sutton, Doina Precup, and Satinder Singh

ISSN 0004-3702. doi: 10.1016/S0004-3702(99)00052-1. URL https://www.sciencedirect.com/science/article/pii/S0004370299000521. V. Vapnik. Principles of Risk Minimization for Learning Theory. InAdvances in Neural Information Processing Systems, volume

work page doi:10.1016/s0004-3702(99)00052-1

[15] [15]

Watkins, P

ISSN 1573-0565. doi: 10.1007/BF00992698. URL https://doi.org/10.1007/ BF00992698. Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. InConference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR,

work page doi:10.1007/bf00992698