Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates

Aritra Mitra; Sreejeet Maity

arxiv: 2509.08933 · v2 · pith:JYNFBQ5Wnew · submitted 2025-09-10 · 💻 cs.LG · cs.SY· eess.SY· math.OC

Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates

Sreejeet Maity , Aritra Mitra This is my paper

Pith reviewed 2026-05-22 12:55 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SYmath.OC

keywords Q-learningreinforcement learningadversarial corruptionasynchronous samplingfinite-time analysisrobust RLalmost-martingales

0 comments

The pith

Despite adversarial reward corruption, a robust asynchronous Q-learning algorithm achieves finite-time convergence rates matching standard bounds up to an additive term that scales with the corruption fraction, and these rates are shown to

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a robust variant of Q-learning designed to learn optimal policies in discounted infinite-horizon reinforcement learning even when rewards are corrupted by an adversary. Under the asynchronous sampling model with time-correlated data, it establishes that the finite-time error bounds match those of the uncorrupted setting except for an extra term linear in the fraction of corrupted samples. The method requires no knowledge of the reward distribution and supplies the first such finite-time robustness results for asynchronous Q-learning. The proof relies on a refined concentration inequality that handles almost-martingales induced by the sampling process.

Core claim

A novel robust Q-learning algorithm attains finite-time convergence to the optimal action-value function at rates that remain near-optimal under adversarial reward corruptions, matching existing asynchronous Q-learning bounds up to an additive term proportional to the corruption fraction; an information-theoretic lower bound confirms that no algorithm can do substantially better, and the guarantees hold without assumptions on the reward distribution.

What carries the argument

A refined Azuma-Hoeffding inequality for almost-martingales that controls concentration of the time-correlated, partially corrupted samples arising in asynchronous Q-learning.

If this is right

The algorithm remains agnostic to the reward distribution while still delivering the stated guarantees.
The same finite-time robustness extends to the asynchronous sampling model with temporal correlations.
Matching lower bounds establish that the derived rates cannot be improved by more than constants.
The refined concentration tool may be reusable for other RL algorithms that face similar sampling and corruption issues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar robustification techniques could be applied to other value-based methods such as SARSA or fitted Q-iteration.
In deployed systems the results suggest that modest corruption fractions incur only linear rather than catastrophic degradation.
The approach may inform the design of RL algorithms that must operate with noisy or manipulated feedback from human users or sensors.

Load-bearing premise

A fixed but unknown fraction of the observed rewards may be arbitrarily corrupted by an adversary, and the almost-martingale structure induced by asynchronous sampling must satisfy the refined concentration inequality used in the analysis.

What would settle it

An experiment that measures the sample complexity needed to reach a given accuracy level while corrupting a known fraction epsilon of rewards and checks whether the observed complexity exceeds the standard bound by more than a constant times epsilon.

Figures

Figures reproduced from arXiv: 2509.08933 by Aritra Mitra, Sreejeet Maity.

**Figure 1.** Figure 1: (Top Left) Plots of the ℓ∞ error Et = ∥Qt−Q∗∥∞ for Vanilla-Q without corruption, under varying reward variances σ 2 ∈ {1, 5, 15}. (Top Right) Plots of the ℓ∞ error Et for Vanilla-Q under the Huber-contaminated reward model in Eq. (3) with corruption probability ε = {0.001, 0.005, 0.01}, variance σ 2 = 1, and a biasing attack where the attack signal is −104 . (Bottom Left and Bottom Right): Analogous plots … view at source ↗

**Figure 2.** Figure 2: (Top Left) Plots of the ℓ∞ error Et = ∥Qt − Q∗∥∞ for Robust Async-Q, with corruption fractions ε ∈ {0.001, 0.005, 0.01} and reward variance σ 2 = 1, under the biasing attack of magnitude −104 . (Top Right) Plots of the ℓ∞ error for Robust Async-Q, with corruption fractions ε ∈ {0.001, 0.005, 0.01} and reward variance σ 2 = 5, under a biasing attack of −104 . (Bottom Left and Bottom Right) Analogous plots f… view at source ↗

**Figure 3.** Figure 3: (Left) Plots of the ℓ∞ error Et = ∥Qt − Q∗∥∞ for Robust Async-RAQ, with corruption fraction ε = 0.001 and reward variance σ 2 = 1, under the biasing attack of magnitude −104 . (Right) Plots of the ℓ∞ error for Robust Async-RAQ, with corruption fraction ε = 0.001 and reward variance σ 2 = 5, under the same biasing attack. The reported plots are averaged over 100 independent runs. Discussion on Experiment 3:… view at source ↗

**Figure 4.** Figure 4: Plots of the ℓ∞ error Et = ∥Qt − Q∗∥∞ for Robust Async-RAQ-M, with corruption fraction ε ∈ {0.001, 0.005, 0.01} and reward variance σ 2 = 1, under the biasing attack of magnitude −104 . The reported plots are averaged over 50 independent runs. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗

read the original abstract

We study the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting in the presence of adversarially corrupted rewards. To address this problem, we develop a novel robust variant of the $Q$-learning algorithm and analyze it under the challenging asynchronous sampling model with time-correlated data. Despite corruption, we prove that the finite-time guarantees of our approach match existing bounds, up to an additive term that scales with the fraction of corrupted samples. We also establish an information-theoretic lower bound, revealing that our guarantees are near-optimal. Notably, our algorithm is agnostic to the underlying reward distribution and provides the first finite-time robustness guarantees for asynchronous $Q$-learning. A key element of our analysis is a refined Azuma-Hoeffding inequality for almost-martingales, which may have broader applicability in the study of RL algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first finite-time near-optimal robustness guarantees for asynchronous Q-learning under adversarial reward corruption.

read the letter

This paper's main result is the first set of finite-time guarantees for a robust asynchronous Q-learning algorithm under reward corruption. The rates match the uncorrupted case up to an additive term that grows with the corruption fraction, and they come with a matching information-theoretic lower bound. The algorithm itself is straightforward and does not require knowledge of the reward distribution, which is a practical plus. They handle the asynchronous setting with its correlated samples by developing a refined version of the Azuma-Hoeffding inequality for almost-martingales. I like that they position this inequality as potentially useful beyond this paper. The overall structure of the proof seems standard for this area, with the novelty in adapting it to corruption. Where it could be weaker is in verifying that the almost-martingale property survives the adversarial corruptions on top of the existing temporal dependencies. If the corruptions introduce larger deviations than the refinement accounts for, the additive term in the bound might not hold as stated. Checking the exact conditions in the proof would clarify this. For readers working on robust RL or on martingale methods in learning theory, this should be worth a look. It is focused enough and has verifiable claims, so it deserves to go to peer review rather than a quick reject.

Referee Report

1 major / 2 minor

Summary. The paper develops a robust variant of asynchronous Q-learning for discounted infinite-horizon MDPs subject to adversarial reward corruptions affecting a bounded fraction of samples. It claims finite-time convergence guarantees that recover the best known rates for the uncorrupted asynchronous case up to an additive term linear in the corruption fraction, establishes a matching information-theoretic lower bound, and introduces a refined Azuma-Hoeffding inequality for almost-martingales as the key technical tool. The algorithm is distribution-agnostic and provides the first such finite-time robustness results under asynchronous sampling.

Significance. If the central bounds hold, the work supplies the first finite-time robustness guarantees for asynchronous Q-learning, a setting that is practically important because real-world data collection is often asynchronous and time-correlated. The information-theoretic lower bound and the distribution-agnostic property strengthen the contribution. The refined concentration inequality for almost-martingales is presented as potentially reusable and is therefore a positive technical feature.

major comments (1)

[Section 4] Section 4 (main analysis) and the statement of the refined Azuma-Hoeffding inequality: the proof must explicitly verify that the almost-martingale property and bounded-difference conditions continue to hold when the asynchronous sampling process (which already induces temporal correlations) is combined with adversarial corruptions on a fraction of the observed rewards. The additive corruption term in the sample-complexity bound rests on this joint applicability; without a self-contained argument showing that corruptions do not introduce uncontrolled deviations outside the almost-martingale framework, the claimed near-optimality is not yet justified.

minor comments (2)

[Abstract] The abstract and introduction could state the precise dependence of the additive term on the corruption fraction (e.g., whether it is linear in the fraction or involves additional logarithmic factors) to make the near-optimality claim immediately quantifiable.
[Section 2] Notation for the corruption model (e.g., the indicator for corrupted samples and the bound on the corruption magnitude) should be introduced once in a dedicated preliminary subsection rather than inline in the analysis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The feedback on the technical details of the analysis in Section 4 is appreciated, and we address the major comment below.

read point-by-point responses

Referee: [Section 4] Section 4 (main analysis) and the statement of the refined Azuma-Hoeffding inequality: the proof must explicitly verify that the almost-martingale property and bounded-difference conditions continue to hold when the asynchronous sampling process (which already induces temporal correlations) is combined with adversarial corruptions on a fraction of the observed rewards. The additive corruption term in the sample-complexity bound rests on this joint applicability; without a self-contained argument showing that corruptions do not introduce uncontrolled deviations outside the almost-martingale framework, the claimed near-optimality is not yet justified.

Authors: We thank the referee for pointing out the need for greater explicitness on this technical point. In the current manuscript, the refined Azuma-Hoeffding inequality is applied to the almost-martingale difference sequence generated by the stochastic approximation errors under asynchronous sampling; the adversarial corruptions are isolated as a separate additive perturbation whose total contribution is bounded by the corruption fraction times the reward range. Because the corruptions affect only a bounded fraction of samples and do not alter the conditional-expectation structure of the clean samples with respect to the filtration induced by the asynchronous process, the almost-martingale and bounded-difference properties are preserved for the noise term. Nevertheless, we agree that a self-contained verification of this joint applicability would strengthen the presentation. In the revised version we will insert a new supporting lemma immediately preceding the main convergence argument in Section 4. The lemma will (i) recall the definition of the almost-martingale under asynchronous sampling, (ii) show that the corruption indicator process contributes only a deterministic bias that is absorbed into the additive corruption term of the final bound, and (iii) verify that the bounded-difference condition continues to hold uniformly over the combined process. This addition will make the justification of the near-optimal rate fully explicit without changing any of the stated theorems. revision: yes

Circularity Check

0 steps flagged

Derivation relies on external concentration inequality and independent lower bound; no reduction to fitted parameters or self-referential definitions.

full rationale

The paper introduces a robust Q-learning variant and derives finite-time bounds by applying a refined Azuma-Hoeffding inequality to almost-martingales under asynchronous sampling and limited adversarial corruption. It separately establishes an information-theoretic lower bound to show near-optimality. No step equates a claimed prediction or rate to a quantity defined by the algorithm's own fitted parameters, nor does any central claim reduce to a self-citation chain that itself assumes the target result. The analysis is presented as self-contained against standard RL benchmarks, with the inequality invoked as a technical tool rather than derived from the paper's outputs. This yields a normal non-finding of circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review is based only on the abstract, so the ledger reflects standard assumptions inferred from the described RL setting and corruption model; full paper would be needed to list all invoked lemmas or background results.

axioms (2)

domain assumption Discounted infinite-horizon MDP with asynchronous sampling and time-correlated observations
The problem setting stated in the abstract.
domain assumption Adversarial reward corruption affecting at most a known fraction of samples
The corruption model under which the robustness guarantees are claimed.

pith-pipeline@v0.9.0 · 5686 in / 1354 out tokens · 40763 ms · 2026-05-22T12:55:05.375656+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A key element of our analysis is a refined Azuma-Hoeffding inequality for almost-martingales... (Theorem 5, Appendix G)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

finite-time convergence rate... ˜O(1/√T) up to additive O(√ε) term (Theorems 2,4,6)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 7 internal anchors

[1]

A review of reinforcement learning in financial applications.Annual Review of Statistics and Its Application, 12(1):209–232, 2025

Yahui Bai, Yuhe Gao, Runzhe Wan, Sheng Zhang, and Rui Song. A review of reinforcement learning in financial applications.Annual Review of Statistics and Its Application, 12(1):209–232, 2025

work page 2025
[2]

Reinforcement learning in personalized medicine: A comprehensive review of treatment optimization strategies.Cureus, 17(4), 2025

K Banumathi, Latha Venkatesan, Lizy Sonia Benjamin, K Vijayalakshmi, Nesa Sathya Satchi, and Nesa Sathya Satchi IV. Reinforcement learning in personalized medicine: A comprehensive review of treatment optimization strategies.Cureus, 17(4), 2025

work page 2025
[3]

Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38, 2022

M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38, 2022

work page 2022
[4]

Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving.arXiv preprint arXiv:1610.03295, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Reinforcement learning in robotics: A survey

Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013

work page 2013
[6]

End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research, 17(1):1334–1373, 2016

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research, 17(1):1334–1373, 2016

work page 2016
[7]

Exploration-driven policy optimization in rlhf: Theoretical insights on efficient data utilization.arXiv preprint arXiv:2402.10342, 2024

Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, and R Srikant. Exploration-driven policy optimization in rlhf: Theoretical insights on efficient data utilization.arXiv preprint arXiv:2402.10342, 2024

work page arXiv 2024
[8]

Corruption robust offline reinforcement learning with human feedback.arXiv preprint arXiv:2402.06734, 2024

Debmalya Mandal, Andi Nika, Parameswaran Kamalaruban, Adish Singla, and Goran Radanović. Corruption robust offline reinforcement learning with human feedback.arXiv preprint arXiv:2402.06734, 2024

work page arXiv 2024
[9]

Provably robust dpo: Aligning language models with noisy feed- back,

Sayak Ray Chowdhury, Anush Kini, and Nagarajan Natarajan. Provably robust dpo: Aligning language models with noisy feedback.arXiv preprint arXiv:2403.00409, 2024

work page arXiv 2024
[10]

Learning product rankings robust to fake users

Negin Golrezaei, Vahideh Manshadi, Jon Schneider, and Shreyas Sekar. Learning product rankings robust to fake users. InProceedings of the 22nd ACM Conference on Economics and Computation, pages 560–561, 2021

work page 2021
[11]

Identifying and treating outliers in finance.Financial Management, 48(2):345–384, 2019

John Adams, Darren Hayunga, Sattar Mansi, David Reeb, and Vincenzo Verardi. Identifying and treating outliers in finance.Financial Management, 48(2):345–384, 2019. 19

work page 2019
[12]

A systems and control perspective of CPS security.Annual reviews in control, 47:394–411, 2019

Seyed Mehran Dibaji, Mohammad Pirani, David Bezalel Flamholz, Anuradha M Annaswamy, Karl Henrik Johansson, and Aranya Chakrabortty. A systems and control perspective of CPS security.Annual reviews in control, 47:394–411, 2019

work page 2019
[13]

Q-learning.Machine learning, 8:279–292, 1992

Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8:279–292, 1992

work page 1992
[14]

Robust estimation of a location parameter

Peter J Huber. Robust estimation of a location parameter. InBreakthroughs in statistics, pages 492–518. Springer, 1992

work page 1992
[15]

John Wiley & Sons, 2004

Peter J Huber.Robust statistics, volume 523. John Wiley & Sons, 2004

work page 2004
[16]

Adversarial Attacks on Stochastic Bandits

Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Xiaojin Zhu. Adversarial attacks on stochastic bandits.arXiv preprint arXiv:1810.12188, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Stochastic bandits robust to adversarial corruptions

Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. InProceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114–122, 2018

work page 2018
[18]

Data poisoning attacks on stochastic bandits

Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. InInternational Conference on Machine Learning, pages 4042–4050. PMLR, 2019

work page 2019
[19]

Better algorithms for stochastic bandits with adversarial corruptions

Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. InConference on Learning Theory, pages 1562–1578. PMLR, 2019

work page 2019
[20]

Corruption-tolerant bandit learning.Machine Learning, 108(4):687–715, 2019

Sayash Kapoor, Kumar Kshitij Patel, and Purushottam Kar. Corruption-tolerant bandit learning.Machine Learning, 108(4):687–715, 2019

work page 2019
[21]

Crimed: Lower and upper bounds on regret for bandits with unbounded stochastic corruption

Shubhada Agrawal, Timothée Mathieu, Debabrota Basu, and Odalric-Ambrym Maillard. Crimed: Lower and upper bounds on regret for bandits with unbounded stochastic corruption. In International Conference on Algorithmic Learning Theory, pages 74–124. PMLR, 2024

work page 2024
[22]

Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

work page 1988
[23]

Robust multivariate mean estimation: the optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021

Gabor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021

work page 2021
[24]

On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence

Nathaniel Korda and Prashanth La. On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence. InInternational conference on machine learning, pages 626–634. PMLR, 2015

work page 2015
[25]

Finite sample analyses for TD (0) with function approximation

Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for TD (0) with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[26]

state-of-the-art

C Narayanan and Csaba Szepesvári. Finite time bounds for temporal difference learning with function approximation: Problems with some “state-of-the-art” results. Technical report, Technical report, 2017

work page 2017
[27]

Linear Stochastic Approximation: Constant Step-Size and Iterate Averaging

Chandrashekar Lakshminarayanan and Csaba Szepesvári. Linear stochastic approximation: Constant step-size and iterate averaging.arXiv preprint arXiv:1709.04073, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Cambridge university press, 2019

Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019. 20

work page 2019
[29]

Finite-time analysis of asynchronous stochastic approximation and Q-learning.Proceedings of Machine Learning Research, 125:1–21, 2020

Adam Wierman Guannan Qu. Finite-time analysis of asynchronous stochastic approximation and Q-learning.Proceedings of Machine Learning Research, 125:1–21, 2020

work page 2020
[30]

Is Q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

Gen Li, Changxiao Cai, Yuxin Chen, Yuting Wei, and Yuejie Chi. Is Q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

work page 2024
[31]

Concentration inequalities and martingale inequalities: a survey

Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1):79–127, 2006

work page 2006
[32]

A Variant of Azuma's Inequality for Martingales with Subgaussian Tails

Ohad Shamir. A variant of Azuma’s inequality for martingales with subgaussian tails.arXiv preprint arXiv:1110.2392, 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011
[33]

Sharp concentration of the chromatic number on random graphs gn,p.Combinatorica, 7(1):121–129, Mar 1987

Eli Shamir and Joel Spencer. Sharp concentration of the chromatic number on random graphs gn,p.Combinatorica, 7(1):121–129, Mar 1987

work page 1987
[34]

Adapting to mixing time in stochastic optimization with Markovian data

Ron Dorfman and Kfir Yehuda Levy. Adapting to mixing time in stochastic optimization with Markovian data. InInternational Conference on Machine Learning, pages 5429–5446. PMLR, 2022

work page 2022
[35]

Least squares regression with Markovian data: Fundamental limits and algorithms.Advances in neural information processing systems, 33:16666–16676, 2020

Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with Markovian data: Fundamental limits and algorithms.Advances in neural information processing systems, 33:16666–16676, 2020

work page 2020
[36]

Stochastic linear bandits robust to adversarial attacks

Ilija Bogunovic, Arpan Losalka, Andreas Krause, and Jonathan Scarlett. Stochastic linear bandits robust to adversarial attacks. InInternational Conference on Artificial Intelligence and Statistics, pages 991–999. PMLR, 2021

work page 2021
[37]

Corruption-tolerant gaussian process bandit optimization

Ilija Bogunovic, Andreas Krause, and Jonathan Scarlett. Corruption-tolerant gaussian process bandit optimization. InInternational Conference on Artificial Intelligence and Statistics, pages 1071–1081. PMLR, 2020

work page 2020
[38]

Adversarial attacks on linear contextual bandits.arXiv preprint arXiv:2002.03839, 2020

Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, and Matteo Pirotta. Adversarial attacks on linear contextual bandits.arXiv preprint arXiv:2002.03839, 2020

work page arXiv 2002
[39]

Nearly optimal algorithms for linear contextual bandits with adversarial corruptions.arXiv preprint arXiv:2205.06811, 2022

Jiafan He, Dongruo Zhou, Tong Zhang, and Quanquan Gu. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions.arXiv preprint arXiv:2205.06811, 2022

work page arXiv 2022
[40]

Corruption-robust exploration in episodic reinforcement learning

Thodoris Lykouris, Max Simchowitz, Alex Slivkins, and Wen Sun. Corruption-robust exploration in episodic reinforcement learning. InConference on Learning Theory, pages 3242–3245. PMLR, 2021

work page 2021
[41]

Improved corruption robust algorithms for episodic reinforcement learning

Yifang Chen, Simon Du, and Kevin Jamieson. Improved corruption robust algorithms for episodic reinforcement learning. InInternational Conference on Machine Learning, pages 1561–1570. PMLR, 2021

work page 2021
[42]

A model selection approach for corruption robust reinforcement learning

Chen-Yu Wei, Christoph Dann, and Julian Zimmert. A model selection approach for corruption robust reinforcement learning. InInternational Conference on Algorithmic Learning Theory, pages 1043–1096. PMLR, 2022

work page 2022
[43]

Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes

Chenlu Ye, Wei Xiong, Quanquan Gu, and Tong Zhang. Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes. In International Conference on Machine Learning, pages 39834–39863. PMLR, 2023. 21

work page 2023
[44]

Corruption-robust offline reinforcement learning

Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, and Wen Sun. Corruption-robust offline reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 5757–5773. PMLR, 2022

work page 2022
[45]

Corruption-robust offline reinforcement learning with general function approximation.Advances in Neural Information Processing Systems, 36:36208–36221, 2023

Chenlu Ye, Rui Yang, Quanquan Gu, and Tong Zhang. Corruption-robust offline reinforcement learning with general function approximation.Advances in Neural Information Processing Systems, 36:36208–36221, 2023

work page 2023
[46]

Robust Q-learning under corrupted rewards

Sreejeet Maity and Aritra Mitra. Robust Q-learning under corrupted rewards. In2024 IEEE 63rd Conference on Decision and Control (CDC), pages 1181–1186. IEEE, 2024

work page 2024
[47]

Robust offline reinforcement learning with heavy-tailed rewards

Jin Zhu, Runzhe Wan, Zhengling Qi, Shikai Luo, and Chengchun Shi. Robust offline reinforcement learning with heavy-tailed rewards. InInternational Conference on Artificial Intelligence and Statistics, pages 541–549. PMLR, 2024

work page 2024
[48]

No-regret reinforcement learning with heavy-tailed rewards

Vincent Zhuang and Yanan Sui. No-regret reinforcement learning with heavy-tailed rewards. InInternational Conference on Artificial Intelligence and Statistics, pages 3385–3393. PMLR, 2021

work page 2021
[49]

Provably robust temporal difference learning for heavy-tailed rewards.Advances in Neural Information Processing Systems, 36, 2024

Semih Cayci and Atilla Eryilmaz. Provably robust temporal difference learning for heavy-tailed rewards.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[50]

A finite time analysis of temporal difference learning with linear function approximation

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR, 2018

work page 2018
[51]

Adversarially-robust TD learning with Markovian data: Finite-time rates and fundamental limits.arXiv preprint arXiv:2502.04662, 2025

Sreejeet Maity and Aritra Mitra. Adversarially-robust TD learning with Markovian data: Finite-time rates and fundamental limits.arXiv preprint arXiv:2502.04662, 2025

work page arXiv 2025
[52]

Asynchronous stochastic approximation and Q-learning.Machine learning, 16:185–202, 1994

John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16:185–202, 1994

work page 1994
[53]

Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

work page 1993
[54]

MIT press, 2018

Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018

work page 2018
[55]

An analysis of temporal-difference learning with function approximation

John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. InIEEE Transactions on Automatic Control, 1997

work page 1997
[56]

Mean estimation and regression under heavy-tailed distributions: A survey.Found

Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey.Found. Comput. Math., 19(5):1145–1190, October 2019

work page 2019
[57]

Learning from untrusted data

Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pages 47–60, 2017

work page 2017
[58]

Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharpℓ∞- bounds forQ-learning.arXiv preprint arXiv:1905.06265, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[59]

Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998

Michael Kearns and Satinder Singh. Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998. 22

work page 1998
[60]

Learning rates for Q-learning.Journal of machine learning Research, 5(1), 2003

Eyal Even-Dar, Yishay Mansour, and Peter Bartlett. Learning rates for Q-learning.Journal of machine learning Research, 5(1), 2003

work page 2003
[61]

Near-optimal time and sample complexities for solving Markov decision processes with a generative model.Advances in Neural Information Processing Systems, 31, 2018

Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[62]

Robust Covariance and Scatter Matrix Estimation under Huber's Contamination Model

Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance matrix estimation via matrix depth.arXiv preprint arXiv:1506.00691, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[63]

Agnostic estimation of mean and covariance

Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016

work page 2016
[64]

High-dimensional robust mean estimation in nearly-linear time

Yu Cheng, Ilias Diakonikolas, and Rong Ge. High-dimensional robust mean estimation in nearly-linear time. InProc. of the thirtieth annual ACM-SIAM symp. on discrete algorithms, pages 2755–2771. SIAM, 2019

work page 2019
[65]

Dalalyan and Arshak Minasyan

Arnak S. Dalalyan and Arshak Minasyan. All-in-one robust estimator of the gaussian mean. The Annals of Statistics, 2022

work page 2022
[66]

Performance of Q-learning with linear function approximation: Stability and finite-time analysis

Zaiwei Chen, Sheng Zhang, Thinh T Doan, Siva Theja Maguluri, and John-Paul Clarke. Performance of Q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, page 4, 2019

work page arXiv 1905
[67]

Finite-time error bounds for linear stochastic approximation and TD learning

Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD learning. InConference on Learning Theory, pages 2803–2830. PMLR, 2019

work page 2019
[68]

Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

work page 2022
[69]

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

Gandharv Patil, LA Prashanth, Dheeraj Nagaraj, and Doina Precup. Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation. InInternational Conference on Artificial Intelligence and Statistics, pages 5438–5448. PMLR, 2023

work page 2023
[70]

A simple finite-time analysis of TD learning with linear function approximation

Aritra Mitra. A simple finite-time analysis of TD learning with linear function approximation. IEEE Transactions on Automatic Control, 70(2):1388–1394, 2024

work page 2024
[71]

Estimating the mixing time of ergodic markov chains

Geoffrey Wolfer and Aryeh Kontorovich. Estimating the mixing time of ergodic markov chains. InConference on Learning Theory, pages 3120–3159. PMLR, 2019

work page 2019
[72]

Mixing time estimation in reversible markov chains from a single sample path.Advances in neural information processing systems, 28, 2015

Daniel J Hsu, Aryeh Kontorovich, and Csaba Szepesvári. Mixing time estimation in reversible markov chains from a single sample path.Advances in neural information processing systems, 28, 2015

work page 2015
[73]

Robust estimation of discrete distributions under local differential privacy

Julien Chhor and Flore Sentenac. Robust estimation of discrete distributions under local differential privacy. InInternational Conference on Algorithmic Learning Theory, pages 411–446. PMLR, 2023

work page 2023
[74]

Robust online and distributed mean estimation under adversarial data corruption

Tong Yao and Shreyas Sundaram. Robust online and distributed mean estimation under adversarial data corruption. In2022 IEEE 61st Conference on Decision and Control (CDC), pages 4193–4198. IEEE, 2022. 23

work page 2022
[75]

Outlier-robust linear system identification under heavy-tailed noise.arXiv preprint arXiv:2501.00421, 2024

Vinay Kanakeri and Aritra Mitra. Outlier-robust linear system identification under heavy-tailed noise.arXiv preprint arXiv:2501.00421, 2024

work page arXiv 2024
[76]

Boosting-enabled robust system identification of partially observed lti systems under heavy-tailed noise.arXiv preprint arXiv:2504.18444, 2025

Vinay Kanakeri and Aritra Mitra. Boosting-enabled robust system identification of partially observed lti systems under heavy-tailed noise.arXiv preprint arXiv:2504.18444, 2025

work page arXiv 2025
[77]

Springer, 2009

Vivek S Borkar.Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009

work page 2009
[78]

The ode method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

work page 2000
[79]

Q-learning with nearest neighbors.Advances in Neural Information Processing Systems, 31, 2018

Devavrat Shah and Qiaomin Xie. Q-learning with nearest neighbors.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[80]

Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling

LA Prashanth, Nathaniel Korda, and Rémi Munos. Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling. Machine Learning, 110(3):559–618, 2021

work page 2021

Showing first 80 references.

[1] [1]

A review of reinforcement learning in financial applications.Annual Review of Statistics and Its Application, 12(1):209–232, 2025

Yahui Bai, Yuhe Gao, Runzhe Wan, Sheng Zhang, and Rui Song. A review of reinforcement learning in financial applications.Annual Review of Statistics and Its Application, 12(1):209–232, 2025

work page 2025

[2] [2]

Reinforcement learning in personalized medicine: A comprehensive review of treatment optimization strategies.Cureus, 17(4), 2025

K Banumathi, Latha Venkatesan, Lizy Sonia Benjamin, K Vijayalakshmi, Nesa Sathya Satchi, and Nesa Sathya Satchi IV. Reinforcement learning in personalized medicine: A comprehensive review of treatment optimization strategies.Cureus, 17(4), 2025

work page 2025

[3] [3]

Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38, 2022

M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38, 2022

work page 2022

[4] [4]

Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving.arXiv preprint arXiv:1610.03295, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Reinforcement learning in robotics: A survey

Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013

work page 2013

[6] [6]

End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research, 17(1):1334–1373, 2016

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research, 17(1):1334–1373, 2016

work page 2016

[7] [7]

Exploration-driven policy optimization in rlhf: Theoretical insights on efficient data utilization.arXiv preprint arXiv:2402.10342, 2024

Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, and R Srikant. Exploration-driven policy optimization in rlhf: Theoretical insights on efficient data utilization.arXiv preprint arXiv:2402.10342, 2024

work page arXiv 2024

[8] [8]

Corruption robust offline reinforcement learning with human feedback.arXiv preprint arXiv:2402.06734, 2024

Debmalya Mandal, Andi Nika, Parameswaran Kamalaruban, Adish Singla, and Goran Radanović. Corruption robust offline reinforcement learning with human feedback.arXiv preprint arXiv:2402.06734, 2024

work page arXiv 2024

[9] [9]

Provably robust dpo: Aligning language models with noisy feed- back,

Sayak Ray Chowdhury, Anush Kini, and Nagarajan Natarajan. Provably robust dpo: Aligning language models with noisy feedback.arXiv preprint arXiv:2403.00409, 2024

work page arXiv 2024

[10] [10]

Learning product rankings robust to fake users

Negin Golrezaei, Vahideh Manshadi, Jon Schneider, and Shreyas Sekar. Learning product rankings robust to fake users. InProceedings of the 22nd ACM Conference on Economics and Computation, pages 560–561, 2021

work page 2021

[11] [11]

Identifying and treating outliers in finance.Financial Management, 48(2):345–384, 2019

John Adams, Darren Hayunga, Sattar Mansi, David Reeb, and Vincenzo Verardi. Identifying and treating outliers in finance.Financial Management, 48(2):345–384, 2019. 19

work page 2019

[12] [12]

A systems and control perspective of CPS security.Annual reviews in control, 47:394–411, 2019

Seyed Mehran Dibaji, Mohammad Pirani, David Bezalel Flamholz, Anuradha M Annaswamy, Karl Henrik Johansson, and Aranya Chakrabortty. A systems and control perspective of CPS security.Annual reviews in control, 47:394–411, 2019

work page 2019

[13] [13]

Q-learning.Machine learning, 8:279–292, 1992

Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8:279–292, 1992

work page 1992

[14] [14]

Robust estimation of a location parameter

Peter J Huber. Robust estimation of a location parameter. InBreakthroughs in statistics, pages 492–518. Springer, 1992

work page 1992

[15] [15]

John Wiley & Sons, 2004

Peter J Huber.Robust statistics, volume 523. John Wiley & Sons, 2004

work page 2004

[16] [16]

Adversarial Attacks on Stochastic Bandits

Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Xiaojin Zhu. Adversarial attacks on stochastic bandits.arXiv preprint arXiv:1810.12188, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

Stochastic bandits robust to adversarial corruptions

Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. InProceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114–122, 2018

work page 2018

[18] [18]

Data poisoning attacks on stochastic bandits

Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. InInternational Conference on Machine Learning, pages 4042–4050. PMLR, 2019

work page 2019

[19] [19]

Better algorithms for stochastic bandits with adversarial corruptions

Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. InConference on Learning Theory, pages 1562–1578. PMLR, 2019

work page 2019

[20] [20]

Corruption-tolerant bandit learning.Machine Learning, 108(4):687–715, 2019

Sayash Kapoor, Kumar Kshitij Patel, and Purushottam Kar. Corruption-tolerant bandit learning.Machine Learning, 108(4):687–715, 2019

work page 2019

[21] [21]

Crimed: Lower and upper bounds on regret for bandits with unbounded stochastic corruption

Shubhada Agrawal, Timothée Mathieu, Debabrota Basu, and Odalric-Ambrym Maillard. Crimed: Lower and upper bounds on regret for bandits with unbounded stochastic corruption. In International Conference on Algorithmic Learning Theory, pages 74–124. PMLR, 2024

work page 2024

[22] [22]

Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

work page 1988

[23] [23]

Robust multivariate mean estimation: the optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021

Gabor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021

work page 2021

[24] [24]

On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence

Nathaniel Korda and Prashanth La. On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence. InInternational conference on machine learning, pages 626–634. PMLR, 2015

work page 2015

[25] [25]

Finite sample analyses for TD (0) with function approximation

Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for TD (0) with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018

[26] [26]

state-of-the-art

C Narayanan and Csaba Szepesvári. Finite time bounds for temporal difference learning with function approximation: Problems with some “state-of-the-art” results. Technical report, Technical report, 2017

work page 2017

[27] [27]

Linear Stochastic Approximation: Constant Step-Size and Iterate Averaging

Chandrashekar Lakshminarayanan and Csaba Szepesvári. Linear stochastic approximation: Constant step-size and iterate averaging.arXiv preprint arXiv:1709.04073, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Cambridge university press, 2019

Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019. 20

work page 2019

[29] [29]

Finite-time analysis of asynchronous stochastic approximation and Q-learning.Proceedings of Machine Learning Research, 125:1–21, 2020

Adam Wierman Guannan Qu. Finite-time analysis of asynchronous stochastic approximation and Q-learning.Proceedings of Machine Learning Research, 125:1–21, 2020

work page 2020

[30] [30]

Is Q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

Gen Li, Changxiao Cai, Yuxin Chen, Yuting Wei, and Yuejie Chi. Is Q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

work page 2024

[31] [31]

Concentration inequalities and martingale inequalities: a survey

Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1):79–127, 2006

work page 2006

[32] [32]

A Variant of Azuma's Inequality for Martingales with Subgaussian Tails

Ohad Shamir. A variant of Azuma’s inequality for martingales with subgaussian tails.arXiv preprint arXiv:1110.2392, 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011

[33] [33]

Sharp concentration of the chromatic number on random graphs gn,p.Combinatorica, 7(1):121–129, Mar 1987

Eli Shamir and Joel Spencer. Sharp concentration of the chromatic number on random graphs gn,p.Combinatorica, 7(1):121–129, Mar 1987

work page 1987

[34] [34]

Adapting to mixing time in stochastic optimization with Markovian data

Ron Dorfman and Kfir Yehuda Levy. Adapting to mixing time in stochastic optimization with Markovian data. InInternational Conference on Machine Learning, pages 5429–5446. PMLR, 2022

work page 2022

[35] [35]

Least squares regression with Markovian data: Fundamental limits and algorithms.Advances in neural information processing systems, 33:16666–16676, 2020

Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with Markovian data: Fundamental limits and algorithms.Advances in neural information processing systems, 33:16666–16676, 2020

work page 2020

[36] [36]

Stochastic linear bandits robust to adversarial attacks

Ilija Bogunovic, Arpan Losalka, Andreas Krause, and Jonathan Scarlett. Stochastic linear bandits robust to adversarial attacks. InInternational Conference on Artificial Intelligence and Statistics, pages 991–999. PMLR, 2021

work page 2021

[37] [37]

Corruption-tolerant gaussian process bandit optimization

Ilija Bogunovic, Andreas Krause, and Jonathan Scarlett. Corruption-tolerant gaussian process bandit optimization. InInternational Conference on Artificial Intelligence and Statistics, pages 1071–1081. PMLR, 2020

work page 2020

[38] [38]

Adversarial attacks on linear contextual bandits.arXiv preprint arXiv:2002.03839, 2020

Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, and Matteo Pirotta. Adversarial attacks on linear contextual bandits.arXiv preprint arXiv:2002.03839, 2020

work page arXiv 2002

[39] [39]

Nearly optimal algorithms for linear contextual bandits with adversarial corruptions.arXiv preprint arXiv:2205.06811, 2022

Jiafan He, Dongruo Zhou, Tong Zhang, and Quanquan Gu. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions.arXiv preprint arXiv:2205.06811, 2022

work page arXiv 2022

[40] [40]

Corruption-robust exploration in episodic reinforcement learning

Thodoris Lykouris, Max Simchowitz, Alex Slivkins, and Wen Sun. Corruption-robust exploration in episodic reinforcement learning. InConference on Learning Theory, pages 3242–3245. PMLR, 2021

work page 2021

[41] [41]

Improved corruption robust algorithms for episodic reinforcement learning

Yifang Chen, Simon Du, and Kevin Jamieson. Improved corruption robust algorithms for episodic reinforcement learning. InInternational Conference on Machine Learning, pages 1561–1570. PMLR, 2021

work page 2021

[42] [42]

A model selection approach for corruption robust reinforcement learning

Chen-Yu Wei, Christoph Dann, and Julian Zimmert. A model selection approach for corruption robust reinforcement learning. InInternational Conference on Algorithmic Learning Theory, pages 1043–1096. PMLR, 2022

work page 2022

[43] [43]

Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes

Chenlu Ye, Wei Xiong, Quanquan Gu, and Tong Zhang. Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes. In International Conference on Machine Learning, pages 39834–39863. PMLR, 2023. 21

work page 2023

[44] [44]

Corruption-robust offline reinforcement learning

Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, and Wen Sun. Corruption-robust offline reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 5757–5773. PMLR, 2022

work page 2022

[45] [45]

Corruption-robust offline reinforcement learning with general function approximation.Advances in Neural Information Processing Systems, 36:36208–36221, 2023

Chenlu Ye, Rui Yang, Quanquan Gu, and Tong Zhang. Corruption-robust offline reinforcement learning with general function approximation.Advances in Neural Information Processing Systems, 36:36208–36221, 2023

work page 2023

[46] [46]

Robust Q-learning under corrupted rewards

Sreejeet Maity and Aritra Mitra. Robust Q-learning under corrupted rewards. In2024 IEEE 63rd Conference on Decision and Control (CDC), pages 1181–1186. IEEE, 2024

work page 2024

[47] [47]

Robust offline reinforcement learning with heavy-tailed rewards

Jin Zhu, Runzhe Wan, Zhengling Qi, Shikai Luo, and Chengchun Shi. Robust offline reinforcement learning with heavy-tailed rewards. InInternational Conference on Artificial Intelligence and Statistics, pages 541–549. PMLR, 2024

work page 2024

[48] [48]

No-regret reinforcement learning with heavy-tailed rewards

Vincent Zhuang and Yanan Sui. No-regret reinforcement learning with heavy-tailed rewards. InInternational Conference on Artificial Intelligence and Statistics, pages 3385–3393. PMLR, 2021

work page 2021

[49] [49]

Provably robust temporal difference learning for heavy-tailed rewards.Advances in Neural Information Processing Systems, 36, 2024

Semih Cayci and Atilla Eryilmaz. Provably robust temporal difference learning for heavy-tailed rewards.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[50] [50]

A finite time analysis of temporal difference learning with linear function approximation

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR, 2018

work page 2018

[51] [51]

Adversarially-robust TD learning with Markovian data: Finite-time rates and fundamental limits.arXiv preprint arXiv:2502.04662, 2025

Sreejeet Maity and Aritra Mitra. Adversarially-robust TD learning with Markovian data: Finite-time rates and fundamental limits.arXiv preprint arXiv:2502.04662, 2025

work page arXiv 2025

[52] [52]

Asynchronous stochastic approximation and Q-learning.Machine learning, 16:185–202, 1994

John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16:185–202, 1994

work page 1994

[53] [53]

Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

work page 1993

[54] [54]

MIT press, 2018

Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018

work page 2018

[55] [55]

An analysis of temporal-difference learning with function approximation

John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. InIEEE Transactions on Automatic Control, 1997

work page 1997

[56] [56]

Mean estimation and regression under heavy-tailed distributions: A survey.Found

Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey.Found. Comput. Math., 19(5):1145–1190, October 2019

work page 2019

[57] [57]

Learning from untrusted data

Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pages 47–60, 2017

work page 2017

[58] [58]

Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharpℓ∞- bounds forQ-learning.arXiv preprint arXiv:1905.06265, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[59] [59]

Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998

Michael Kearns and Satinder Singh. Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998. 22

work page 1998

[60] [60]

Learning rates for Q-learning.Journal of machine learning Research, 5(1), 2003

Eyal Even-Dar, Yishay Mansour, and Peter Bartlett. Learning rates for Q-learning.Journal of machine learning Research, 5(1), 2003

work page 2003

[61] [61]

Near-optimal time and sample complexities for solving Markov decision processes with a generative model.Advances in Neural Information Processing Systems, 31, 2018

Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model.Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[62] [62]

Robust Covariance and Scatter Matrix Estimation under Huber's Contamination Model

Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance matrix estimation via matrix depth.arXiv preprint arXiv:1506.00691, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[63] [63]

Agnostic estimation of mean and covariance

Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016

work page 2016

[64] [64]

High-dimensional robust mean estimation in nearly-linear time

Yu Cheng, Ilias Diakonikolas, and Rong Ge. High-dimensional robust mean estimation in nearly-linear time. InProc. of the thirtieth annual ACM-SIAM symp. on discrete algorithms, pages 2755–2771. SIAM, 2019

work page 2019

[65] [65]

Dalalyan and Arshak Minasyan

Arnak S. Dalalyan and Arshak Minasyan. All-in-one robust estimator of the gaussian mean. The Annals of Statistics, 2022

work page 2022

[66] [66]

Performance of Q-learning with linear function approximation: Stability and finite-time analysis

Zaiwei Chen, Sheng Zhang, Thinh T Doan, Siva Theja Maguluri, and John-Paul Clarke. Performance of Q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, page 4, 2019

work page arXiv 1905

[67] [67]

Finite-time error bounds for linear stochastic approximation and TD learning

Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD learning. InConference on Learning Theory, pages 2803–2830. PMLR, 2019

work page 2019

[68] [68]

Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

work page 2022

[69] [69]

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

Gandharv Patil, LA Prashanth, Dheeraj Nagaraj, and Doina Precup. Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation. InInternational Conference on Artificial Intelligence and Statistics, pages 5438–5448. PMLR, 2023

work page 2023

[70] [70]

A simple finite-time analysis of TD learning with linear function approximation

Aritra Mitra. A simple finite-time analysis of TD learning with linear function approximation. IEEE Transactions on Automatic Control, 70(2):1388–1394, 2024

work page 2024

[71] [71]

Estimating the mixing time of ergodic markov chains

Geoffrey Wolfer and Aryeh Kontorovich. Estimating the mixing time of ergodic markov chains. InConference on Learning Theory, pages 3120–3159. PMLR, 2019

work page 2019

[72] [72]

Mixing time estimation in reversible markov chains from a single sample path.Advances in neural information processing systems, 28, 2015

Daniel J Hsu, Aryeh Kontorovich, and Csaba Szepesvári. Mixing time estimation in reversible markov chains from a single sample path.Advances in neural information processing systems, 28, 2015

work page 2015

[73] [73]

Robust estimation of discrete distributions under local differential privacy

Julien Chhor and Flore Sentenac. Robust estimation of discrete distributions under local differential privacy. InInternational Conference on Algorithmic Learning Theory, pages 411–446. PMLR, 2023

work page 2023

[74] [74]

Robust online and distributed mean estimation under adversarial data corruption

Tong Yao and Shreyas Sundaram. Robust online and distributed mean estimation under adversarial data corruption. In2022 IEEE 61st Conference on Decision and Control (CDC), pages 4193–4198. IEEE, 2022. 23

work page 2022

[75] [75]

Outlier-robust linear system identification under heavy-tailed noise.arXiv preprint arXiv:2501.00421, 2024

Vinay Kanakeri and Aritra Mitra. Outlier-robust linear system identification under heavy-tailed noise.arXiv preprint arXiv:2501.00421, 2024

work page arXiv 2024

[76] [76]

Boosting-enabled robust system identification of partially observed lti systems under heavy-tailed noise.arXiv preprint arXiv:2504.18444, 2025

Vinay Kanakeri and Aritra Mitra. Boosting-enabled robust system identification of partially observed lti systems under heavy-tailed noise.arXiv preprint arXiv:2504.18444, 2025

work page arXiv 2025

[77] [77]

Springer, 2009

Vivek S Borkar.Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009

work page 2009

[78] [78]

The ode method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

work page 2000

[79] [79]

Q-learning with nearest neighbors.Advances in Neural Information Processing Systems, 31, 2018

Devavrat Shah and Qiaomin Xie. Q-learning with nearest neighbors.Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[80] [80]

Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling

LA Prashanth, Nathaniel Korda, and Rémi Munos. Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling. Machine Learning, 110(3):559–618, 2021

work page 2021