TabQL: In-Context Q-Learning with Tabular Foundation Models

Aditya Balu; Ashutosh Kumar Nirala; Qisai Liu; Soumik Sarkar; Timilehin Ayanlade; Yang Li; Zhanhong Jiang

arxiv: 2605.18979 · v1 · pith:IMTTUJISnew · submitted 2026-05-18 · 💻 cs.LG

TabQL: In-Context Q-Learning with Tabular Foundation Models

Qisai Liu , Zhanhong Jiang , Timilehin Ayanlade , Ashutosh Kumar Nirala , Yang Li , Aditya Balu , Soumik Sarkar This is my paper

Pith reviewed 2026-05-20 12:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningQ-learningin-context learningtabular foundation modelsDQNsample efficiencyBellman updates

0 comments

The pith

TabQL replaces the Q-network in DQN with a tabular foundation model that uses in-context learning to achieve better sample efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TabQL as a reinforcement learning framework that substitutes the conventional parametric Q-network in Deep Q-Learning with a tabular foundation model capable of in-context learning. This allows the model to represent Q-values through sequences of state-action-Q-value tuples and adapt rapidly by conditioning on recent experience without full retraining. A warm-up phase with standard DQN bootstraps initial quality, after which new transitions are generated from TabQL actions paired with DQN Q-value predictions. Formal analysis shows TabQL interpolates between vanilla Q-learning and DQN while amortizing Bellman updates through in-context mechanisms to improve efficiency. Experiments across benchmarks demonstrate practical gains in adaptation speed and performance.

Core claim

TabQL formalizes a framework in which a sequence-to-sequence foundation model operating over tabularized state-action-Q-value tuples performs zero- or few-shot Q-value inference, thereby enabling convergence and reduced sample complexity relative to DQN by replacing repeated parametric Bellman updates with in-context conditioning on recent tuples.

What carries the argument

The tabular foundation model that conditions on sequences of state-action-Q-value tuples to perform in-context Q-value inference and amortize Bellman updates.

Load-bearing premise

The tabular foundation model can reliably perform zero- or few-shot Q-value inference from limited online interactions when conditioned on a tabularized representation of state-action-Q-value tuples.

What would settle it

Running TabQL against DQN on standard RL benchmarks and finding no reduction in episodes or samples needed to reach target returns would falsify the efficiency improvement.

Figures

Figures reproduced from arXiv: 2605.18979 by Aditya Balu, Ashutosh Kumar Nirala, Qisai Liu, Soumik Sarkar, Timilehin Ayanlade, Yang Li, Zhanhong Jiang.

**Figure 1.** Figure 1: TabQL: A warm-up phase initializes an informative context for Q-value inference. During online Bellman inference, actions selected by the TFM via in-context learning are executed in the environment, while corresponding Q-values are predicted using the DQN trained during warmup. The resulting transitions are incorporated into the context, enabling continual refinement of incontext Q-value inference. Howe… view at source ↗

**Figure 2.** Figure 2: TabQL (TabPFN/TabDPT) vs. five baselines (Tabular Q, DQN, Double DQN, Dueling [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Switching point analysis: Varying warm-up length [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of context size K on TabQL training. Increasing K accelerates training and stabilizes the learning curve up to a saturation point, beyond which further increases yield diminishing returns. FrozenLake CliffWalking 0.0 0.2 0.4 0.6 0.8 Normalized Return DQN Tabular (a) Generalization across environments 0 10 20 30 40 50 Number of Initial Conditions Used 300 200 100 0 100 Cumulative Reward run 1 run 2 r… view at source ↗

**Figure 5.** Figure 5: (a) Performance across environments: bars show mean normalized return (scaled to [0,1]) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Switching point analysis for Frozen Lake [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Switching point analysis under extremely early switching in FrozenLake. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Impact of context size in Frozen Lake. Replay Buffer and Context Construction. In TabQL, the replay buffer stores state, action, and Q value information rather than raw transition tuples. Specifically, for each visited state, we record the estimated Q-values for all available actions. During context construction, we sample a fixed number 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

read the original abstract

We propose Tabular Q-Learning (TabQL), a reinforcement learning framework that replaces the conventional parametric Q-network in Deep Q-Learning (DQN) with a tabular foundation model endowed with in-context learning capabilities. The key idea is to represent Q-values through a sequence-to-sequence foundation model operating over a tabularized representation of state-action-Q-value tuples, enabling rapid adaptation from limited online interaction by conditioning on recent experience. TabQL departs from classical DQN by leveraging (i) zero- or few-shot Q-value inference via in-context updates, and (ii) a warm-up phase using standard DQN to bootstrap high-quality context. Particularly, to enhance the context quality, new transitions are generated by executing actions output by TabQL with predicted Q values from DQN. We formalize TabQL, analyze its convergence and sample complexity under mild assumptions, and show that TabQL interpolates between vanilla Q-learning and DQN with in-context learning. Our analysis demonstrates that TabQL achieves improved efficiency compared to DQN by amortizing Bellman updates through in-context learning. Extensive numerical experiments with several benchmarks showcase the effectiveness and efficacy of the proposed TabQL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TabQL swaps the Q-network for a tabular foundation model but the efficiency gains over DQN look unclear because of the required warm-up and mixed DQN predictions.

read the letter

TabQL swaps the Q-network for a tabular foundation model but the efficiency gains over DQN look unclear because of the required warm-up and mixed DQN predictions. The paper frames Q-value inference as a sequence-to-sequence task over tabular state-action-Q tuples so the model can adapt from limited online data via in-context conditioning. It adds a DQN warm-up phase to seed good context and generates new transitions by running TabQL actions while pulling Q-values from the DQN predictor to keep context quality high. They also claim to formalize an interpolation between vanilla Q-learning and DQN plus convergence and sample-complexity results under mild assumptions. Benchmark experiments are reported to show practical gains. That combination of framing and empirical checks is the main thing the work contributes. The soft spot is the efficiency argument. The method still spends samples on the DQN warm-up and continues to rely on DQN for some Q-values inside the loop. The analysis of amortized Bellman updates does not appear to subtract or bound those additive costs, so the net sample saving relative to plain DQN is not guaranteed by the stated results. If the full derivations or ablations address this directly it would strengthen the central claim; otherwise the practical advantage stays provisional. The math and data presentation follow standard patterns for this area with no obvious internal contradictions. This paper is aimed at RL people who want to test foundation-model style adaptation inside classical algorithms when interaction data is expensive. A reader already working on in-context learning or hybrid tabular methods would get the most out of the framing and experiments. It deserves peer review so the proofs and the exact experimental controls can be examined in detail.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes TabQL, a reinforcement learning framework that replaces the parametric Q-network in DQN with a tabular foundation model using in-context learning over a sequence-to-sequence representation of state-action-Q-value tuples. It includes a DQN warm-up phase to bootstrap context, generates new transitions by executing TabQL actions while sourcing Q-values from the DQN predictor, formalizes the method, analyzes convergence and sample complexity under mild assumptions, and claims that TabQL interpolates between vanilla Q-learning and DQN while achieving improved efficiency by amortizing Bellman updates through in-context learning. Benchmark experiments are reported to demonstrate effectiveness.

Significance. If the efficiency gains can be shown to hold after rigorously bounding the hybrid overhead, the work would offer a potentially useful direction for sample-efficient RL by leveraging foundation models for rapid in-context adaptation in tabular settings. The interpolation result between Q-learning and DQN, together with the formal analysis, would be a strength if the derivations are complete and the assumptions are verified.

major comments (2)

[§4] §4 (Convergence and Sample Complexity Analysis): The claimed improvement in sample efficiency over DQN via amortized Bellman updates does not appear to subtract or bound the additive sample cost of the DQN warm-up phase or the overhead of generating new transitions with TabQL actions but DQN-sourced Q-values, as described in the method. This leaves the net amortization benefit relative to pure DQN unestablished by the stated result.
[§3] §3 (TabQL Formalization): The central efficiency claim depends on the tabular foundation model performing reliable zero- or few-shot Q-value inference from limited online interactions when conditioned on the tabularized tuples, yet the analysis provides no explicit bound or additional justification for this assumption beyond the mild assumptions stated for convergence.

minor comments (2)

[Abstract] The abstract refers to 'mild assumptions' without listing them or pointing to their precise statement in the text; adding a short enumeration or cross-reference would improve clarity.
[§3] The description of how the tabularized representation is constructed and how the foundation model is conditioned during inference could be expanded with a concrete example or pseudocode for better reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below with clarifications and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [§4] §4 (Convergence and Sample Complexity Analysis): The claimed improvement in sample efficiency over DQN via amortized Bellman updates does not appear to subtract or bound the additive sample cost of the DQN warm-up phase or the overhead of generating new transitions with TabQL actions but DQN-sourced Q-values, as described in the method. This leaves the net amortization benefit relative to pure DQN unestablished by the stated result.

Authors: We agree that the current analysis in §4 focuses on the amortization benefit during the in-context phase after the warm-up and does not explicitly subtract the fixed sample cost of the DQN warm-up or bound the hybrid transition generation overhead. This is a valid observation. In the revised manuscript we will extend the sample-complexity theorem to include the warm-up length as an additive constant term and add a remark discussing the net gain relative to pure DQN when the horizon is sufficiently long. The interpolation result between Q-learning and DQN remains unchanged. revision: yes
Referee: [§3] §3 (TabQL Formalization): The central efficiency claim depends on the tabular foundation model performing reliable zero- or few-shot Q-value inference from limited online interactions when conditioned on the tabularized tuples, yet the analysis provides no explicit bound or additional justification for this assumption beyond the mild assumptions stated for convergence.

Authors: The convergence analysis in §3 and §4 is stated under mild assumptions that already encompass the tabular foundation model’s ability to produce accurate in-context Q-value estimates from the provided tabular tuples. We will make this assumption more explicit in the revision by adding a short paragraph that references the model’s pre-training on tabular data and the empirical reliability shown in the benchmark experiments. No new theoretical bound on inference error will be derived, as that would require additional assumptions outside the paper’s scope. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper formalizes TabQL as a hybrid of in-context tabular foundation model with a DQN warm-up phase, then analyzes convergence and sample complexity under mild assumptions while showing interpolation between vanilla Q-learning and DQN. No equations or steps are exhibited that reduce the claimed efficiency gain (amortization of Bellman updates) to quantities already defined by the warm-up or by self-citation. The interpolation result is presented as a formal property of the framework rather than a definitional tautology, and no load-bearing self-citations or fitted inputs renamed as predictions appear in the provided text. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full text not available, so ledger entries are limited to statements explicitly present in the abstract.

axioms (1)

domain assumption Mild assumptions under which convergence and sample complexity results hold
Explicitly invoked in the abstract for the theoretical analysis of TabQL.

invented entities (1)

TabQL framework no independent evidence
purpose: To enable in-context Q-learning via tabular foundation models
New method introduced in the paper.

pith-pipeline@v0.9.0 · 5756 in / 1421 out tokens · 62726 ms · 2026-05-20T12:30:38.510718+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize TabQL, analyze its convergence and sample complexity under mild assumptions, and show that TabQL interpolates between vanilla Q-learning and DQN with in-context learning.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The TFM fϕ approximates the fixed point of ˆTCt, amortizing Bellman updates through in-context learning rather than gradient descent.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 3 internal anchors

[1]

Reinforcement learning: A survey.Journal of artificial intelligence research, 4:237–285, 1996

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey.Journal of artificial intelligence research, 4:237–285, 1996

work page 1996
[2]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998

work page 1998
[3]

Q-learning.Machine learning, 8(3):279–292, 1992

Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3):279–292, 1992

work page 1992
[4]

A theoretical analysis of deep q-learning

Jianqing Fan, Zhaoran Wang, Yuchen Xie, and Zhuoran Yang. A theoretical analysis of deep q-learning. InLearning for dynamics and control, pages 486–489. PMLR, 2020

work page 2020
[5]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[6]

Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):153–188, 2025

Chen Tang, Ben Abbatematteo, Jiaheng Hu, Rohan Chandra, Roberto Martín-Martín, and Peter Stone. Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):153–188, 2025

work page 2025
[7]

Mobile robot application with hierarchical start position dqn.Computational intelligence and Neuroscience, 2022(1):4115767, 2022

Emre Erkan and Muhammet Ali Arserim. Mobile robot application with hierarchical start position dqn.Computational intelligence and Neuroscience, 2022(1):4115767, 2022

work page 2022
[8]

Hybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem.Neurocomputing, 523:44–57, 2023

Lingli Yu, Shuxin Huo, Zhengjiu Wang, and Keyi Li. Hybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem.Neurocomputing, 523:44–57, 2023

work page 2023
[9]

Multi-robot path planning based on a deep reinforce- ment learning dqn algorithm.CAAI Transactions on Intelligence Technology, 5(3):177–183, 2020

Yang Yang, Li Juntao, and Peng Lingling. Multi-robot path planning based on a deep reinforce- ment learning dqn algorithm.CAAI Transactions on Intelligence Technology, 5(3):177–183, 2020

work page 2020
[10]

A hybrid dqn and optimization approach for strategy and resource allocation in mec networks.IEEE Transactions on Wireless Communications, 20(7):4282–4295, 2021

Yi-Chen Wu, Thinh Quang Dinh, Yaru Fu, Che Lin, and Tony QS Quek. A hybrid dqn and optimization approach for strategy and resource allocation in mec networks.IEEE Transactions on Wireless Communications, 20(7):4282–4295, 2021

work page 2021
[11]

A comparative study of dqn and d3qn for hvac system optimization control.Energy, 307:132740, 2024

Haosen Qin, Tao Meng, Kan Chen, and Zhengwei Li. A comparative study of dqn and d3qn for hvac system optimization control.Energy, 307:132740, 2024

work page 2024
[12]

Ship energy scheduling with dqn-ce algorithm combining bi-directional lstm and attention mechanism.Applied Energy, 347:121378, 2023

Haipeng Xiao, Lijun Fu, Chengya Shang, Xianqiang Bao, Xinghua Xu, and Wenxia Guo. Ship energy scheduling with dqn-ce algorithm combining bi-directional lstm and attention mechanism.Applied Energy, 347:121378, 2023

work page 2023
[13]

Multi-objective optimization of the textile manufacturing process using deep-q-network based multi-agent reinforcement learning.Journal of Manufacturing Systems, 62:939–949, 2022

Zhenglei He, Kim Phuc Tran, Sebastien Thomassey, Xianyi Zeng, Jie Xu, and Changhai Yi. Multi-objective optimization of the textile manufacturing process using deep-q-network based multi-agent reinforcement learning.Journal of Manufacturing Systems, 62:939–949, 2022

work page 2022
[14]

Distributed real-time scheduling in cloud manufacturing by deep reinforcement learning.IEEE Transactions on Industrial Informatics, 18(12):8999–9007, 2022

Lixiang Zhang, Chen Yang, Yan Yan, and Yaoguang Hu. Distributed real-time scheduling in cloud manufacturing by deep reinforcement learning.IEEE Transactions on Industrial Informatics, 18(12):8999–9007, 2022

work page 2022
[15]

Urbanenqosplace: A deep reinforcement learning model for service placement of real-time smart city iot applications.IEEE Transactions on Services Computing, 16(4):3043–3060, 2022

Maggi Bansal, Inderveer Chana, and Siobhán Clarke. Urbanenqosplace: A deep reinforcement learning model for service placement of real-time smart city iot applications.IEEE Transactions on Services Computing, 16(4):3043–3060, 2022

work page 2022
[16]

Toward deep q-network- based resource allocation in industrial internet of things.IEEE internet of things journal, 9(12):9138–9150, 2021

Fan Liang, Wei Yu, Xing Liu, David Griffith, and Nada Golmie. Toward deep q-network- based resource allocation in industrial internet of things.IEEE internet of things journal, 9(12):9138–9150, 2021

work page 2021
[17]

Foundation models defining a new era in vision: a survey and outlook.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: a survey and outlook.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 11

work page 2025
[18]

On the power of foundation models

Yang Yuan. On the power of foundation models. InInternational conference on machine learning, pages 40519–40530. PMLR, 2023

work page 2023
[19]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, 2024

work page 2024
[20]

The learnability of in-context learning

Noam Wies, Yoav Levine, and Amnon Shashua. The learnability of in-context learning. Advances in Neural Information Processing Systems, 36:36637–36651, 2023

work page 2023
[21]

From small to large: In-context learning as a new paradigm for domain generalization.International Journal of Computer Vision, 134(1):9, 2026

Guanglin Zhou, Zhongyi Han, Shaoan Xie, Shiming Chen, Biwei Huang, Liming Zhu, Xin Gao, Lina Yao, and Salman Khan. From small to large: In-context learning as a new paradigm for domain generalization.International Journal of Computer Vision, 134(1):9, 2026

work page 2026
[22]

In-context learning with representations: Contextual generalization of trained transformers.Advances in Neural Information Processing Systems, 37:85867–85898, 2024

Tong Yang, Yu Huang, Yingbin Liang, and Yuejie Chi. In-context learning with representations: Contextual generalization of trained transformers.Advances in Neural Information Processing Systems, 37:85867–85898, 2024

work page 2024
[23]

What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization

Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. arXiv preprint arXiv:2305.19420, 2023

work page arXiv 2023
[24]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second.arXiv preprint arXiv:2207.01848, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foundation model for in-context learning on large data.arXiv preprint arXiv:2502.05564, 2025

work page internal anchor Pith review arXiv 2025
[26]

TabDPT: Scaling tabular foundation models

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Maksims V olkovs, and Anthony L Caterini. Tabdpt: Scaling tabular foundation models.arXiv preprint arXiv:2410.18164, 2024

work page arXiv 2024
[27]

Bridging automated to autonomous cyber defense: Foundational analysis of tabular q-learning

Andy Applebaum, Camron Dennler, Patrick Dwyer, Marina Moskowitz, Harold Nguyen, Nicole Nichols, Nicole Park, Paul Rachwalski, Frank Rau, Adrian Webster, et al. Bridging automated to autonomous cyber defense: Foundational analysis of tabular q-learning. InProceedings of the 15th ACM Workshop on Artificial Intelligence and Security, pages 149–159, 2022

work page 2022
[28]

Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

work page 2021
[29]

Online decision transformer

Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. Ininternational conference on machine learning, pages 27042–27059. PMLR, 2022

work page 2022
[30]

Elastic decision transformer.Advances in neural information processing systems, 36:18532–18550, 2023

Yueh-Hua Wu, Xiaolong Wang, and Masashi Hamaya. Elastic decision transformer.Advances in neural information processing systems, 36:18532–18550, 2023

work page 2023
[31]

Temporal differences-based policy iteration and applica- tions in neuro-dynamic programming.Lab

Dimitri P Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration and applica- tions in neuro-dynamic programming.Lab. for Info. and Decision Systems Report LIDS-P-2349, MIT, Cambridge, MA, 14:8, 1996

work page 1996
[32]

Deep reinforcement learning with double q-learning

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the AAAI conference on artificial intelligence, volume 30, 2016

work page 2016
[33]

Dueling network architectures for deep reinforcement learning

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. InInternational conference on machine learning, pages 1995–2003. PMLR, 2016

work page 1995
[34]

A distributional perspective on reinforce- ment learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- ment learning. InInternational conference on machine learning, pages 449–458. PMLR, 2017. 12

work page 2017
[35]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[36]

What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

work page 2022
[37]

Transformers learn in-context by gradient descent

Johannes V on Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, pages 35151–35174. PMLR, 2023

work page 2023
[38]

Trans- formers as algorithms: Generalization and stability in in-context learning

Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Trans- formers as algorithms: Generalization and stability in in-context learning. InInternational conference on machine learning, pages 19565–19594. PMLR, 2023

work page 2023
[39]

Offline reinforcement learning as one big sequence modeling problem.Advances in neural information processing systems, 34:1273– 1286, 2021

Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem.Advances in neural information processing systems, 34:1273– 1286, 2021

work page 2021
[40]

Hyper- decision transformer for efficient online policy adaptation.arXiv preprint arXiv:2304.08487, 2023

Mengdi Xu, Yuchen Lu, Yikang Shen, Shun Zhang, Ding Zhao, and Chuang Gan. Hyper- decision transformer for efficient online policy adaptation.arXiv preprint arXiv:2304.08487, 2023

work page arXiv 2023
[41]

Bellman operator convergence enhancements in reinforcement learning algorithms.arXiv preprint arXiv:2505.14564, 2025

David Krame Kadurha, Domini Jocema Leko Moutouo, and Yae Ulrich Gaba. Bellman operator convergence enhancements in reinforcement learning algorithms.arXiv preprint arXiv:2505.14564, 2025

work page arXiv 2025
[42]

Gradient free deep reinforcement learning with tabpfn.arXiv preprint arXiv:2509.11259, 2025

David Schiff, Ofir Lindenbaum, and Yonathan Efroni. Gradient free deep reinforcement learning with tabpfn.arXiv preprint arXiv:2509.11259, 2025

work page arXiv 2025
[43]

Fitted q-iteration in continuous action-space mdps.Advances in neural information processing systems, 20, 2007

András Antos, Csaba Szepesvári, and Rémi Munos. Fitted q-iteration in continuous action-space mdps.Advances in neural information processing systems, 20, 2007

work page 2007
[44]

CRC press, 2017

Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst.Reinforcement learning and dynamic programming using function approximators. CRC press, 2017

work page 2017
[45]

Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

work page 1963
[46]

Improved hoeffding’s lemma and hoeffding’s tail bounds.arXiv preprint arXiv:2012.03535, 2020

David Hertz. Improved hoeffding’s lemma and hoeffding’s tail bounds.arXiv preprint arXiv:2012.03535, 2020

work page arXiv 2012
[47]

Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path.Machine Learning, 71(1):89–129, 2008

András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path.Machine Learning, 71(1):89–129, 2008

work page 2008
[48]

Finite-time bounds for fitted value iteration.Journal of Machine Learning Research, 9(5), 2008

Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration.Journal of Machine Learning Research, 9(5), 2008

work page 2008
[49]

Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

Gen Li, Changxiao Cai, Yuxin Chen, Yuting Wei, and Yuejie Chi. Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

work page 2024
[50]

Ali Beikmohammadi. Nars vs. reinforcement learning: Ona vs. q-learning.arXiv preprint arXiv:2212.12517, 2022

work page arXiv 2022
[51]

The foundation: Markov decision processes

Nimish Sanghi. The foundation: Markov decision processes. InDeep Reinforcement Learning with Python: RLHF for Chatbots and Large Language Models, pages 43–87. Springer, 2024

work page 2024
[52]

Combining q-learning and search with amortized value estimates.arXiv preprint arXiv:1912.02807, 2019

Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Tobias Pfaff, Theophane Weber, Lars Buesing, and Peter W Battaglia. Combining q-learning and search with amortized value estimates.arXiv preprint arXiv:1912.02807, 2019. 13

work page arXiv 1912
[53]

Tutorial on amortized optimization.Foundations and Trends® in Machine Learning, 16(5):592–732, 2023

Brandon Amos et al. Tutorial on amortized optimization.Foundations and Trends® in Machine Learning, 16(5):592–732, 2023

work page 2023
[54]

Brave: Offline reinforcement learning for discrete combinatorial action spaces.arXiv preprint arXiv:2410.21151, 2024

Matthew Landers, Taylor W Killian, Hugo Barnes, Thomas Hartvigsen, and Afsaneh Doryab. Brave: Offline reinforcement learning for discrete combinatorial action spaces.arXiv preprint arXiv:2410.21151, 2024. 14 A Appendix In this section, we present additional related work and analysis, technical discussion, proofs of the main results, and additional experim...

work page arXiv 2024

[1] [1]

Reinforcement learning: A survey.Journal of artificial intelligence research, 4:237–285, 1996

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey.Journal of artificial intelligence research, 4:237–285, 1996

work page 1996

[2] [2]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998

work page 1998

[3] [3]

Q-learning.Machine learning, 8(3):279–292, 1992

Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3):279–292, 1992

work page 1992

[4] [4]

A theoretical analysis of deep q-learning

Jianqing Fan, Zhaoran Wang, Yuchen Xie, and Zhuoran Yang. A theoretical analysis of deep q-learning. InLearning for dynamics and control, pages 486–489. PMLR, 2020

work page 2020

[5] [5]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[6] [6]

Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):153–188, 2025

Chen Tang, Ben Abbatematteo, Jiaheng Hu, Rohan Chandra, Roberto Martín-Martín, and Peter Stone. Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):153–188, 2025

work page 2025

[7] [7]

Mobile robot application with hierarchical start position dqn.Computational intelligence and Neuroscience, 2022(1):4115767, 2022

Emre Erkan and Muhammet Ali Arserim. Mobile robot application with hierarchical start position dqn.Computational intelligence and Neuroscience, 2022(1):4115767, 2022

work page 2022

[8] [8]

Hybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem.Neurocomputing, 523:44–57, 2023

Lingli Yu, Shuxin Huo, Zhengjiu Wang, and Keyi Li. Hybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem.Neurocomputing, 523:44–57, 2023

work page 2023

[9] [9]

Multi-robot path planning based on a deep reinforce- ment learning dqn algorithm.CAAI Transactions on Intelligence Technology, 5(3):177–183, 2020

Yang Yang, Li Juntao, and Peng Lingling. Multi-robot path planning based on a deep reinforce- ment learning dqn algorithm.CAAI Transactions on Intelligence Technology, 5(3):177–183, 2020

work page 2020

[10] [10]

A hybrid dqn and optimization approach for strategy and resource allocation in mec networks.IEEE Transactions on Wireless Communications, 20(7):4282–4295, 2021

Yi-Chen Wu, Thinh Quang Dinh, Yaru Fu, Che Lin, and Tony QS Quek. A hybrid dqn and optimization approach for strategy and resource allocation in mec networks.IEEE Transactions on Wireless Communications, 20(7):4282–4295, 2021

work page 2021

[11] [11]

A comparative study of dqn and d3qn for hvac system optimization control.Energy, 307:132740, 2024

Haosen Qin, Tao Meng, Kan Chen, and Zhengwei Li. A comparative study of dqn and d3qn for hvac system optimization control.Energy, 307:132740, 2024

work page 2024

[12] [12]

Ship energy scheduling with dqn-ce algorithm combining bi-directional lstm and attention mechanism.Applied Energy, 347:121378, 2023

Haipeng Xiao, Lijun Fu, Chengya Shang, Xianqiang Bao, Xinghua Xu, and Wenxia Guo. Ship energy scheduling with dqn-ce algorithm combining bi-directional lstm and attention mechanism.Applied Energy, 347:121378, 2023

work page 2023

[13] [13]

Multi-objective optimization of the textile manufacturing process using deep-q-network based multi-agent reinforcement learning.Journal of Manufacturing Systems, 62:939–949, 2022

Zhenglei He, Kim Phuc Tran, Sebastien Thomassey, Xianyi Zeng, Jie Xu, and Changhai Yi. Multi-objective optimization of the textile manufacturing process using deep-q-network based multi-agent reinforcement learning.Journal of Manufacturing Systems, 62:939–949, 2022

work page 2022

[14] [14]

Distributed real-time scheduling in cloud manufacturing by deep reinforcement learning.IEEE Transactions on Industrial Informatics, 18(12):8999–9007, 2022

Lixiang Zhang, Chen Yang, Yan Yan, and Yaoguang Hu. Distributed real-time scheduling in cloud manufacturing by deep reinforcement learning.IEEE Transactions on Industrial Informatics, 18(12):8999–9007, 2022

work page 2022

[15] [15]

Urbanenqosplace: A deep reinforcement learning model for service placement of real-time smart city iot applications.IEEE Transactions on Services Computing, 16(4):3043–3060, 2022

Maggi Bansal, Inderveer Chana, and Siobhán Clarke. Urbanenqosplace: A deep reinforcement learning model for service placement of real-time smart city iot applications.IEEE Transactions on Services Computing, 16(4):3043–3060, 2022

work page 2022

[16] [16]

Toward deep q-network- based resource allocation in industrial internet of things.IEEE internet of things journal, 9(12):9138–9150, 2021

Fan Liang, Wei Yu, Xing Liu, David Griffith, and Nada Golmie. Toward deep q-network- based resource allocation in industrial internet of things.IEEE internet of things journal, 9(12):9138–9150, 2021

work page 2021

[17] [17]

Foundation models defining a new era in vision: a survey and outlook.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: a survey and outlook.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 11

work page 2025

[18] [18]

On the power of foundation models

Yang Yuan. On the power of foundation models. InInternational conference on machine learning, pages 40519–40530. PMLR, 2023

work page 2023

[19] [19]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, 2024

work page 2024

[20] [20]

The learnability of in-context learning

Noam Wies, Yoav Levine, and Amnon Shashua. The learnability of in-context learning. Advances in Neural Information Processing Systems, 36:36637–36651, 2023

work page 2023

[21] [21]

From small to large: In-context learning as a new paradigm for domain generalization.International Journal of Computer Vision, 134(1):9, 2026

Guanglin Zhou, Zhongyi Han, Shaoan Xie, Shiming Chen, Biwei Huang, Liming Zhu, Xin Gao, Lina Yao, and Salman Khan. From small to large: In-context learning as a new paradigm for domain generalization.International Journal of Computer Vision, 134(1):9, 2026

work page 2026

[22] [22]

In-context learning with representations: Contextual generalization of trained transformers.Advances in Neural Information Processing Systems, 37:85867–85898, 2024

Tong Yang, Yu Huang, Yingbin Liang, and Yuejie Chi. In-context learning with representations: Contextual generalization of trained transformers.Advances in Neural Information Processing Systems, 37:85867–85898, 2024

work page 2024

[23] [23]

What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization

Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. arXiv preprint arXiv:2305.19420, 2023

work page arXiv 2023

[24] [24]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second.arXiv preprint arXiv:2207.01848, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foundation model for in-context learning on large data.arXiv preprint arXiv:2502.05564, 2025

work page internal anchor Pith review arXiv 2025

[26] [26]

TabDPT: Scaling tabular foundation models

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Maksims V olkovs, and Anthony L Caterini. Tabdpt: Scaling tabular foundation models.arXiv preprint arXiv:2410.18164, 2024

work page arXiv 2024

[27] [27]

Bridging automated to autonomous cyber defense: Foundational analysis of tabular q-learning

Andy Applebaum, Camron Dennler, Patrick Dwyer, Marina Moskowitz, Harold Nguyen, Nicole Nichols, Nicole Park, Paul Rachwalski, Frank Rau, Adrian Webster, et al. Bridging automated to autonomous cyber defense: Foundational analysis of tabular q-learning. InProceedings of the 15th ACM Workshop on Artificial Intelligence and Security, pages 149–159, 2022

work page 2022

[28] [28]

Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

work page 2021

[29] [29]

Online decision transformer

Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. Ininternational conference on machine learning, pages 27042–27059. PMLR, 2022

work page 2022

[30] [30]

Elastic decision transformer.Advances in neural information processing systems, 36:18532–18550, 2023

Yueh-Hua Wu, Xiaolong Wang, and Masashi Hamaya. Elastic decision transformer.Advances in neural information processing systems, 36:18532–18550, 2023

work page 2023

[31] [31]

Temporal differences-based policy iteration and applica- tions in neuro-dynamic programming.Lab

Dimitri P Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration and applica- tions in neuro-dynamic programming.Lab. for Info. and Decision Systems Report LIDS-P-2349, MIT, Cambridge, MA, 14:8, 1996

work page 1996

[32] [32]

Deep reinforcement learning with double q-learning

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the AAAI conference on artificial intelligence, volume 30, 2016

work page 2016

[33] [33]

Dueling network architectures for deep reinforcement learning

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. InInternational conference on machine learning, pages 1995–2003. PMLR, 2016

work page 1995

[34] [34]

A distributional perspective on reinforce- ment learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- ment learning. InInternational conference on machine learning, pages 449–458. PMLR, 2017. 12

work page 2017

[35] [35]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[36] [36]

What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

work page 2022

[37] [37]

Transformers learn in-context by gradient descent

Johannes V on Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, pages 35151–35174. PMLR, 2023

work page 2023

[38] [38]

Trans- formers as algorithms: Generalization and stability in in-context learning

Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Trans- formers as algorithms: Generalization and stability in in-context learning. InInternational conference on machine learning, pages 19565–19594. PMLR, 2023

work page 2023

[39] [39]

Offline reinforcement learning as one big sequence modeling problem.Advances in neural information processing systems, 34:1273– 1286, 2021

Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem.Advances in neural information processing systems, 34:1273– 1286, 2021

work page 2021

[40] [40]

Hyper- decision transformer for efficient online policy adaptation.arXiv preprint arXiv:2304.08487, 2023

Mengdi Xu, Yuchen Lu, Yikang Shen, Shun Zhang, Ding Zhao, and Chuang Gan. Hyper- decision transformer for efficient online policy adaptation.arXiv preprint arXiv:2304.08487, 2023

work page arXiv 2023

[41] [41]

Bellman operator convergence enhancements in reinforcement learning algorithms.arXiv preprint arXiv:2505.14564, 2025

David Krame Kadurha, Domini Jocema Leko Moutouo, and Yae Ulrich Gaba. Bellman operator convergence enhancements in reinforcement learning algorithms.arXiv preprint arXiv:2505.14564, 2025

work page arXiv 2025

[42] [42]

Gradient free deep reinforcement learning with tabpfn.arXiv preprint arXiv:2509.11259, 2025

David Schiff, Ofir Lindenbaum, and Yonathan Efroni. Gradient free deep reinforcement learning with tabpfn.arXiv preprint arXiv:2509.11259, 2025

work page arXiv 2025

[43] [43]

Fitted q-iteration in continuous action-space mdps.Advances in neural information processing systems, 20, 2007

András Antos, Csaba Szepesvári, and Rémi Munos. Fitted q-iteration in continuous action-space mdps.Advances in neural information processing systems, 20, 2007

work page 2007

[44] [44]

CRC press, 2017

Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst.Reinforcement learning and dynamic programming using function approximators. CRC press, 2017

work page 2017

[45] [45]

Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

work page 1963

[46] [46]

Improved hoeffding’s lemma and hoeffding’s tail bounds.arXiv preprint arXiv:2012.03535, 2020

David Hertz. Improved hoeffding’s lemma and hoeffding’s tail bounds.arXiv preprint arXiv:2012.03535, 2020

work page arXiv 2012

[47] [47]

Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path.Machine Learning, 71(1):89–129, 2008

András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path.Machine Learning, 71(1):89–129, 2008

work page 2008

[48] [48]

Finite-time bounds for fitted value iteration.Journal of Machine Learning Research, 9(5), 2008

Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration.Journal of Machine Learning Research, 9(5), 2008

work page 2008

[49] [49]

Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

Gen Li, Changxiao Cai, Yuxin Chen, Yuting Wei, and Yuejie Chi. Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

work page 2024

[50] [50]

Ali Beikmohammadi. Nars vs. reinforcement learning: Ona vs. q-learning.arXiv preprint arXiv:2212.12517, 2022

work page arXiv 2022

[51] [51]

The foundation: Markov decision processes

Nimish Sanghi. The foundation: Markov decision processes. InDeep Reinforcement Learning with Python: RLHF for Chatbots and Large Language Models, pages 43–87. Springer, 2024

work page 2024

[52] [52]

Combining q-learning and search with amortized value estimates.arXiv preprint arXiv:1912.02807, 2019

Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Tobias Pfaff, Theophane Weber, Lars Buesing, and Peter W Battaglia. Combining q-learning and search with amortized value estimates.arXiv preprint arXiv:1912.02807, 2019. 13

work page arXiv 1912

[53] [53]

Tutorial on amortized optimization.Foundations and Trends® in Machine Learning, 16(5):592–732, 2023

Brandon Amos et al. Tutorial on amortized optimization.Foundations and Trends® in Machine Learning, 16(5):592–732, 2023

work page 2023

[54] [54]

Brave: Offline reinforcement learning for discrete combinatorial action spaces.arXiv preprint arXiv:2410.21151, 2024

Matthew Landers, Taylor W Killian, Hugo Barnes, Thomas Hartvigsen, and Afsaneh Doryab. Brave: Offline reinforcement learning for discrete combinatorial action spaces.arXiv preprint arXiv:2410.21151, 2024. 14 A Appendix In this section, we present additional related work and analysis, technical discussion, proofs of the main results, and additional experim...

work page arXiv 2024