Generalisation in Multitask Fitted Q-Iteration and Offline Q-learning

Kausthubh Manda; Raghuram Bharadwaj Diddigi

arxiv: 2512.20220 · v2 · submitted 2025-12-23 · 💻 cs.LG

Generalisation in Multitask Fitted Q-Iteration and Offline Q-learning

Kausthubh Manda , Raghuram Bharadwaj Diddigi This is my paper

Pith reviewed 2026-05-16 20:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords multitask reinforcement learningoffline Q-learningfitted Q-iterationgeneralization boundslow-rank representationsvalue function estimationfinite-sample analysisBellman error minimization

0 comments

The pith

Multitask fitted Q-iteration learns shared low-rank action-value representations to achieve 1/sqrt(nT) generalization rates from offline data across tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies offline reinforcement learning with multiple related tasks that share a low-rank structure in their action-value functions. It introduces a multitask variant of fitted Q-iteration that jointly optimizes a shared representation and task-specific value functions by minimizing Bellman errors on fixed datasets from each task. Under standard realizability and coverage assumptions, the analysis proves finite-sample bounds showing that the estimation error scales as 1 over the square root of the total number of samples pooled across all tasks. The work also examines a downstream setting where a new task reuses the learned representation, which reduces the sample complexity for value estimation relative to learning independently. A sympathetic reader cares because the results quantify when shared structure can improve statistical efficiency in model-free offline Q-learning without requiring additional online interaction.

Core claim

Under standard realizability and coverage assumptions commonly used in offline reinforcement learning, multitask fitted Q-iteration jointly learns a shared low-rank representation of the action-value functions together with task-specific value functions via Bellman error minimization on offline data, yielding finite-sample generalization guarantees with error scaling as 1/sqrt(nT) on the total samples across tasks while retaining the usual dependence on the horizon and concentrability coefficients.

What carries the argument

Multitask fitted Q-iteration that jointly learns a shared low-rank representation and task-specific value functions via Bellman error minimization on offline data.

If this is right

Pooling samples across tasks improves estimation accuracy proportionally to the total sample count nT.
The usual horizon length and concentrability factors from distribution shift remain unchanged in the bounds.
Reusing the shared representation for a new downstream task lowers the effective complexity of its value estimation compared to learning from scratch.
The guarantees apply specifically to value-based methods under the offline multitask regime with fixed datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other offline algorithms that minimize Bellman errors if they incorporate similar joint representation learning.
In practice this points to collecting data from related tasks first to exploit shared structure before tackling new ones.
It raises the question of how to verify or learn the low-rank structure from data when it is not known in advance.

Load-bearing premise

The tasks share a low-rank representation of their action-value functions, together with the standard realizability and coverage assumptions that allow the Bellman error minimization to produce the stated rates.

What would settle it

Observe whether the learned value functions' generalization error decreases as 1/sqrt(total samples) when the tasks are known to share the low-rank structure; failure to observe this scaling under the stated assumptions would falsify the claim.

Figures

Figures reproduced from arXiv: 2512.20220 by Kausthubh Manda, Raghuram Bharadwaj Diddigi.

**Figure 2.** Figure 2: n-Scaling: Proof of Sample Consistency. Bound: [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: H-Scaling: Proof of Recursive Error Propagation. Bound: [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

We study offline multitask reinforcement learning in settings where multiple tasks share a low-rank representation of their action-value functions. In this regime, a learner is provided with fixed datasets collected from several related tasks, without access to further online interaction, and seeks to exploit shared structure to improve statistical efficiency and generalization. We analyze a multitask variant of fitted Q-iteration that jointly learns a shared representation and task-specific value functions via Bellman error minimization on offline data. Under standard realizability and coverage assumptions commonly used in offline reinforcement learning, we establish finite-sample generalization guarantees for the learned value functions. Our analysis explicitly characterizes how pooling data across tasks improves estimation accuracy, yielding a $1/\sqrt{nT}$ dependence on the total number of samples across tasks, while retaining the usual dependence on the horizon and concentrability coefficients arising from distribution shift. In addition, we consider a downstream offline setting in which a new task shares the same underlying representation as the upstream tasks. We study how reusing the representation learned during the multitask phase affects value estimation for this new task, and show that it can reduce the effective complexity of downstream learning relative to learning from scratch. Together, our results clarify the role of shared representations in multitask offline Q-learning and provide theoretical insight into when and how multitask structure can improve generalization in model-free, value-based reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives 1/sqrt(nT) finite-sample bounds for multitask offline fitted Q-iteration under low-rank shared Q-function structure, plus a downstream reuse result.

read the letter

The main thing to know is that this work extends single-task offline RL analysis to the multitask case by assuming tasks share a low-rank representation of their action-value functions. They analyze a joint fitted Q-iteration that pools the offline datasets and minimizes Bellman error across tasks, then show how this yields a 1/sqrt(nT) rate on total samples while keeping the usual horizon and concentrability factors. They also bound the benefit of freezing the learned representation for a new downstream task.

Referee Report

2 major / 2 minor

Summary. The paper studies offline multitask RL where tasks share a low-rank representation of their action-value functions. It analyzes a multitask fitted Q-iteration procedure that jointly minimizes Bellman error over pooled offline datasets from T tasks, establishes finite-sample generalization bounds for the learned value functions under standard realizability and coverage assumptions, and derives a 1/√(nT) rate in total samples nT. It further analyzes transfer to a new downstream task that reuses the learned representation, showing reduced effective complexity relative to learning from scratch.

Significance. If the central claims hold, the work supplies the first explicit finite-sample analysis of representation learning in offline multitask Q-learning, clarifying how pooling across tasks yields a total-sample rate without extra factors in T or rank while preserving standard concentrability dependence. The transfer result gives a concrete mechanism by which upstream multitask training can improve downstream sample efficiency, which is directly relevant to practical offline RL pipelines that exploit shared structure.

major comments (2)

[§4, Theorem 1] §4, Theorem 1 (and its proof): the 1/√(nT) rate is stated to follow from the low-rank shared structure reducing effective function-class complexity, yet the argument does not explicitly bound the error incurred when the joint optimization recovers the shared representation matrix; without this intermediate bound it is unclear whether the claimed rate survives the usual concentrability factors.
[§5] §5, transfer bound: the downstream guarantee assumes the upstream representation is frozen exactly, but the finite-sample error in the learned representation (which scales with 1/√(nT)) is not propagated into the new-task concentrability coefficient; this omission makes the claimed reduction in effective complexity non-quantitative.

minor comments (2)

Notation for the concentrability coefficients is introduced only in the appendix; a brief definition in the main text would improve readability.
[Abstract] The abstract claims the bounds retain the “usual dependence on the horizon,” but the precise horizon factor (H or H²) is not restated in the theorem statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address each major comment below and indicate the revisions we plan to make to strengthen the presentation.

read point-by-point responses

Referee: [§4, Theorem 1] §4, Theorem 1 (and its proof): the 1/√(nT) rate is stated to follow from the low-rank shared structure reducing effective function-class complexity, yet the argument does not explicitly bound the error incurred when the joint optimization recovers the shared representation matrix; without this intermediate bound it is unclear whether the claimed rate survives the usual concentrability factors.

Authors: We appreciate the referee's observation. The proof of Theorem 1 derives the 1/√(nT) rate by applying uniform convergence over the joint low-rank function class (shared representation plus task-specific heads) under the given realizability and coverage assumptions. The joint minimization controls the representation recovery error implicitly through the excess Bellman risk bound, without introducing extra factors in T or rank beyond standard concentrability. To make this step fully explicit, we will insert an intermediate lemma bounding the Frobenius-norm error of the recovered representation matrix at rate O(1/√(nT)) (scaled by concentrability) and propagate it directly into the value-function bound. This will be added to the revised proof of Theorem 1. revision: yes
Referee: [§5] §5, transfer bound: the downstream guarantee assumes the upstream representation is frozen exactly, but the finite-sample error in the learned representation (which scales with 1/√(nT)) is not propagated into the new-task concentrability coefficient; this omission makes the claimed reduction in effective complexity non-quantitative.

Authors: We agree that a fully quantitative transfer result requires propagating the upstream representation error. The current analysis in §5 freezes the representation to isolate the complexity reduction for the downstream task. We will revise the transfer theorem to include an additive perturbation term in the downstream concentrability coefficient that scales with the upstream 1/√(nT) error. Under a mild condition that this error is sufficiently small relative to the downstream sample size, the effective complexity reduction (by a factor depending on rank) remains intact, yielding an improved downstream rate. This extension will be incorporated into the revised Section 5. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under external assumptions

full rationale

The paper derives finite-sample generalization bounds for multitask fitted Q-iteration by jointly minimizing Bellman error over pooled offline datasets from tasks sharing a low-rank representation of action-value functions. The 1/√(nT) rate follows from the reduced effective complexity of the shared representation under standard realizability and concentrability assumptions drawn from the offline RL literature. No load-bearing step reduces by construction to fitted parameters inside the paper, nor relies on self-citation chains or ansatzes smuggled from prior author work; the central claims remain independent of the target results and are supported by external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two standard offline RL assumptions plus the modeling choice of low-rank shared representation; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Realizability: the true action-value functions lie in the function class used by the learner
Invoked to obtain the generalization guarantees under Bellman error minimization.
domain assumption Coverage: the offline data distribution satisfies sufficient coverage of the state-action space
Required to control distribution shift and obtain the stated rates involving concentrability coefficients.

pith-pipeline@v0.9.0 · 5547 in / 1442 out tokens · 27579 ms · 2026-05-16T20:16:25.974858+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We assume that the tasks share a latent representation of the state space: there exists a feature encoder ϕ∗:S×A→R^d such that the optimal action-value functions satisfy Q(∗,t)_h(s,a)=⟨ϕ∗(s,a),ω(∗,t)_h⟩
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

yielding a 1/√(nT) dependence on the total number of samples across tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

[1]

Optimality and approximation with policy gradient methods in markov decision processes

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. InConference on learning theory, pages 64–66. PMLR, 2020

work page 2020
[2]

Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path

Andras Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. InMachine Learning, volume 71, pages 89–129, 2008

work page 2008
[3]

Representation learning: A review and new perspectives

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013

work page 2013
[4]

Learning Shared Representations in Multi-task Reinforcement Learning

Diana Borsa, Thore Graepel, and John Shawe-Taylor. Learning shared representations in multi-task reinforcement learning, 2016. URLhttps://arxiv.org/abs/1603.02041

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Adversarially trained actor critic for offline reinforcement learning.ArXiv, abs/2202.02446, 2022

Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning.ArXiv, abs/2202.02446, 2022. URL https://api.semanticscholar.org/CorpusID: 246634886

work page arXiv 2022
[6]

Kernel-based reinforcement learning: A finite-time analysis

Omar Darwiche Domingues, Pierre Menard, Matteo Pirotta, Emilie Kaufmann, and Michal Valko. Kernel-based reinforcement learning: A finite-time analysis. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 2783–2792. PMLR, 18–24 Jul 2021. U...

work page 2021
[7]

Provably efficient rl with rich observations via latent state decoding

Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient rl with rich observations via latent state decoding. InInternational Conference on Machine Learning, pages 1665–1674. PMLR, 2019

work page 2019
[8]

Tree-based batch mode reinforcement learning.Journal of Machine Learning Research, 6(Apr):503–556, 2005

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning.Journal of Machine Learning Research, 6(Apr):503–556, 2005

work page 2005
[9]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019

work page 2052
[10]

Goal-oriented skill abstraction for offline multi-task reinforcement learning

Jinmin He, Kai Li, Yifan Zang, Haobo Fu, QIANG FU, Junliang Xing, and Jian Cheng. Goal-oriented skill abstraction for offline multi-task reinforcement learning. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=ZeetWz8zbG

work page 2025
[11]

Offline multitask representation learning for reinforcement learning

Haque Ishfaq, Thanh Nguyen-Tang, Songtao Feng, Raman Arora, Mengdi Wang, Ming Yin, and Doina Precup. Offline multitask representation learning for reinforcement learning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=72tRD2Mfjd

work page 2024
[12]

Pessimism in the face of uncertainty: theoretical results for offline reinforcement learning

Chi Jin, Zhuoran Yang, and Zhaoran Wang. Pessimism in the face of uncertainty: theoretical results for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 34, pages 13283–13297, 2021

work page 2021
[13]

Offline reinforcement learning with implicit q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2021

work page 2021
[14]

Conservative q-learning for offline reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 1179–1191, 2020

work page 2020
[15]

Offline q-learning on diverse multi-task data both scales and generalizes, 2023

Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, and Sergey Levine. Offline q-learning on diverse multi-task data both scales and generalizes, 2023. URLhttps://arxiv.org/abs/2211.15144

work page arXiv 2023
[16]

Transfer in reinforcement learning: a framework and a survey

Alessandro Lazaric. Transfer in reinforcement learning: a framework and a survey. InReinforcement Learning: State-of-the-Art, pages 143–173. Springer, 2012

work page 2012
[17]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[18]

The MIT Press, second edition, 2018

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of Machine Learning. The MIT Press, second edition, 2018. ISBN 9780262039406

work page 2018
[19]

Finite-time bounds for fitted value iteration

Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. InJournal of Machine Learning Research, volume 9, pages 815–857, 2008

work page 2008
[20]

Bridging offline reinforcement learning and imitation learning: A tale of pessimism.IEEE Transactions on Information Theory, PP:1–1, 12 2022

Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism.IEEE Transactions on Information Theory, PP:1–1, 12 2022. doi:10.1109/TIT.2022.3185139

work page doi:10.1109/tit.2022.3185139 2022
[21]

Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method

Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. InEuropean Conference on Machine Learning, pages 317–328. Springer, 2005. 10 arXivTemplateA PREPRINT

work page 2005
[22]

Cambridge University Press, 2014

Shai Shalev-Shwartz and Shai Ben-David.Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. ISBN 9781107057135. URL https://www.cs.huji.ac.il/~shais/ UnderstandingMachineLearning/

work page 2014
[23]

Transfer learning for reinforcement learning domains: A survey.Journal of Machine Learning Research, 10(7), 2009

Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.Journal of Machine Learning Research, 10(7), 2009. 11 arXivTemplateA PREPRINT Contents 1 Introduction 1 2 Related Work 2 3 Preliminary 3 3.1 Multitask Offline Q-Learning with Shared Representations . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Role of th...

work page 2009

[1] [1]

Optimality and approximation with policy gradient methods in markov decision processes

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. InConference on learning theory, pages 64–66. PMLR, 2020

work page 2020

[2] [2]

Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path

Andras Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. InMachine Learning, volume 71, pages 89–129, 2008

work page 2008

[3] [3]

Representation learning: A review and new perspectives

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013

work page 2013

[4] [4]

Learning Shared Representations in Multi-task Reinforcement Learning

Diana Borsa, Thore Graepel, and John Shawe-Taylor. Learning shared representations in multi-task reinforcement learning, 2016. URLhttps://arxiv.org/abs/1603.02041

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Adversarially trained actor critic for offline reinforcement learning.ArXiv, abs/2202.02446, 2022

Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning.ArXiv, abs/2202.02446, 2022. URL https://api.semanticscholar.org/CorpusID: 246634886

work page arXiv 2022

[6] [6]

Kernel-based reinforcement learning: A finite-time analysis

Omar Darwiche Domingues, Pierre Menard, Matteo Pirotta, Emilie Kaufmann, and Michal Valko. Kernel-based reinforcement learning: A finite-time analysis. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 2783–2792. PMLR, 18–24 Jul 2021. U...

work page 2021

[7] [7]

Provably efficient rl with rich observations via latent state decoding

Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient rl with rich observations via latent state decoding. InInternational Conference on Machine Learning, pages 1665–1674. PMLR, 2019

work page 2019

[8] [8]

Tree-based batch mode reinforcement learning.Journal of Machine Learning Research, 6(Apr):503–556, 2005

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning.Journal of Machine Learning Research, 6(Apr):503–556, 2005

work page 2005

[9] [9]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019

work page 2052

[10] [10]

Goal-oriented skill abstraction for offline multi-task reinforcement learning

Jinmin He, Kai Li, Yifan Zang, Haobo Fu, QIANG FU, Junliang Xing, and Jian Cheng. Goal-oriented skill abstraction for offline multi-task reinforcement learning. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=ZeetWz8zbG

work page 2025

[11] [11]

Offline multitask representation learning for reinforcement learning

Haque Ishfaq, Thanh Nguyen-Tang, Songtao Feng, Raman Arora, Mengdi Wang, Ming Yin, and Doina Precup. Offline multitask representation learning for reinforcement learning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=72tRD2Mfjd

work page 2024

[12] [12]

Pessimism in the face of uncertainty: theoretical results for offline reinforcement learning

Chi Jin, Zhuoran Yang, and Zhaoran Wang. Pessimism in the face of uncertainty: theoretical results for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 34, pages 13283–13297, 2021

work page 2021

[13] [13]

Offline reinforcement learning with implicit q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2021

work page 2021

[14] [14]

Conservative q-learning for offline reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 1179–1191, 2020

work page 2020

[15] [15]

Offline q-learning on diverse multi-task data both scales and generalizes, 2023

Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, and Sergey Levine. Offline q-learning on diverse multi-task data both scales and generalizes, 2023. URLhttps://arxiv.org/abs/2211.15144

work page arXiv 2023

[16] [16]

Transfer in reinforcement learning: a framework and a survey

Alessandro Lazaric. Transfer in reinforcement learning: a framework and a survey. InReinforcement Learning: State-of-the-Art, pages 143–173. Springer, 2012

work page 2012

[17] [17]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[18] [18]

The MIT Press, second edition, 2018

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of Machine Learning. The MIT Press, second edition, 2018. ISBN 9780262039406

work page 2018

[19] [19]

Finite-time bounds for fitted value iteration

Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. InJournal of Machine Learning Research, volume 9, pages 815–857, 2008

work page 2008

[20] [20]

Bridging offline reinforcement learning and imitation learning: A tale of pessimism.IEEE Transactions on Information Theory, PP:1–1, 12 2022

Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism.IEEE Transactions on Information Theory, PP:1–1, 12 2022. doi:10.1109/TIT.2022.3185139

work page doi:10.1109/tit.2022.3185139 2022

[21] [21]

Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method

Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. InEuropean Conference on Machine Learning, pages 317–328. Springer, 2005. 10 arXivTemplateA PREPRINT

work page 2005

[22] [22]

Cambridge University Press, 2014

Shai Shalev-Shwartz and Shai Ben-David.Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. ISBN 9781107057135. URL https://www.cs.huji.ac.il/~shais/ UnderstandingMachineLearning/

work page 2014

[23] [23]

Transfer learning for reinforcement learning domains: A survey.Journal of Machine Learning Research, 10(7), 2009

Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.Journal of Machine Learning Research, 10(7), 2009. 11 arXivTemplateA PREPRINT Contents 1 Introduction 1 2 Related Work 2 3 Preliminary 3 3.1 Multitask Offline Q-Learning with Shared Representations . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Role of th...

work page 2009