pith. sign in

arxiv: 2509.08933 · v2 · pith:JYNFBQ5Wnew · submitted 2025-09-10 · 💻 cs.LG · cs.SY· eess.SY· math.OC

Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates

Pith reviewed 2026-05-22 12:55 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SYmath.OC
keywords Q-learningreinforcement learningadversarial corruptionasynchronous samplingfinite-time analysisrobust RLalmost-martingales
0
0 comments X

The pith

Despite adversarial reward corruption, a robust asynchronous Q-learning algorithm achieves finite-time convergence rates matching standard bounds up to an additive term that scales with the corruption fraction, and these rates are shown to

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a robust variant of Q-learning designed to learn optimal policies in discounted infinite-horizon reinforcement learning even when rewards are corrupted by an adversary. Under the asynchronous sampling model with time-correlated data, it establishes that the finite-time error bounds match those of the uncorrupted setting except for an extra term linear in the fraction of corrupted samples. The method requires no knowledge of the reward distribution and supplies the first such finite-time robustness results for asynchronous Q-learning. The proof relies on a refined concentration inequality that handles almost-martingales induced by the sampling process.

Core claim

A novel robust Q-learning algorithm attains finite-time convergence to the optimal action-value function at rates that remain near-optimal under adversarial reward corruptions, matching existing asynchronous Q-learning bounds up to an additive term proportional to the corruption fraction; an information-theoretic lower bound confirms that no algorithm can do substantially better, and the guarantees hold without assumptions on the reward distribution.

What carries the argument

A refined Azuma-Hoeffding inequality for almost-martingales that controls concentration of the time-correlated, partially corrupted samples arising in asynchronous Q-learning.

If this is right

  • The algorithm remains agnostic to the reward distribution while still delivering the stated guarantees.
  • The same finite-time robustness extends to the asynchronous sampling model with temporal correlations.
  • Matching lower bounds establish that the derived rates cannot be improved by more than constants.
  • The refined concentration tool may be reusable for other RL algorithms that face similar sampling and corruption issues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar robustification techniques could be applied to other value-based methods such as SARSA or fitted Q-iteration.
  • In deployed systems the results suggest that modest corruption fractions incur only linear rather than catastrophic degradation.
  • The approach may inform the design of RL algorithms that must operate with noisy or manipulated feedback from human users or sensors.

Load-bearing premise

A fixed but unknown fraction of the observed rewards may be arbitrarily corrupted by an adversary, and the almost-martingale structure induced by asynchronous sampling must satisfy the refined concentration inequality used in the analysis.

What would settle it

An experiment that measures the sample complexity needed to reach a given accuracy level while corrupting a known fraction epsilon of rewards and checks whether the observed complexity exceeds the standard bound by more than a constant times epsilon.

Figures

Figures reproduced from arXiv: 2509.08933 by Aritra Mitra, Sreejeet Maity.

Figure 1
Figure 1. Figure 1: (Top Left) Plots of the ℓ∞ error Et = ∥Qt−Q∗∥∞ for Vanilla-Q without corruption, under varying reward variances σ 2 ∈ {1, 5, 15}. (Top Right) Plots of the ℓ∞ error Et for Vanilla-Q under the Huber-contaminated reward model in Eq. (3) with corruption probability ε = {0.001, 0.005, 0.01}, variance σ 2 = 1, and a biasing attack where the attack signal is −104 . (Bottom Left and Bottom Right): Analogous plots … view at source ↗
Figure 2
Figure 2. Figure 2: (Top Left) Plots of the ℓ∞ error Et = ∥Qt − Q∗∥∞ for Robust Async-Q, with corruption fractions ε ∈ {0.001, 0.005, 0.01} and reward variance σ 2 = 1, under the biasing attack of magnitude −104 . (Top Right) Plots of the ℓ∞ error for Robust Async-Q, with corruption fractions ε ∈ {0.001, 0.005, 0.01} and reward variance σ 2 = 5, under a biasing attack of −104 . (Bottom Left and Bottom Right) Analogous plots f… view at source ↗
Figure 3
Figure 3. Figure 3: (Left) Plots of the ℓ∞ error Et = ∥Qt − Q∗∥∞ for Robust Async-RAQ, with corruption fraction ε = 0.001 and reward variance σ 2 = 1, under the biasing attack of magnitude −104 . (Right) Plots of the ℓ∞ error for Robust Async-RAQ, with corruption fraction ε = 0.001 and reward variance σ 2 = 5, under the same biasing attack. The reported plots are averaged over 100 independent runs. Discussion on Experiment 3:… view at source ↗
Figure 4
Figure 4. Figure 4: Plots of the ℓ∞ error Et = ∥Qt − Q∗∥∞ for Robust Async-RAQ-M, with corruption fraction ε ∈ {0.001, 0.005, 0.01} and reward variance σ 2 = 1, under the biasing attack of magnitude −104 . The reported plots are averaged over 50 independent runs. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗
read the original abstract

We study the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting in the presence of adversarially corrupted rewards. To address this problem, we develop a novel robust variant of the \(Q\)-learning algorithm and analyze it under the challenging asynchronous sampling model with time-correlated data. Despite corruption, we prove that the finite-time guarantees of our approach match existing bounds, up to an additive term that scales with the fraction of corrupted samples. We also establish an information-theoretic lower bound, revealing that our guarantees are near-optimal. Notably, our algorithm is agnostic to the underlying reward distribution and provides the first finite-time robustness guarantees for asynchronous \(Q\)-learning. A key element of our analysis is a refined Azuma-Hoeffding inequality for almost-martingales, which may have broader applicability in the study of RL algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper develops a robust variant of asynchronous Q-learning for discounted infinite-horizon MDPs subject to adversarial reward corruptions affecting a bounded fraction of samples. It claims finite-time convergence guarantees that recover the best known rates for the uncorrupted asynchronous case up to an additive term linear in the corruption fraction, establishes a matching information-theoretic lower bound, and introduces a refined Azuma-Hoeffding inequality for almost-martingales as the key technical tool. The algorithm is distribution-agnostic and provides the first such finite-time robustness results under asynchronous sampling.

Significance. If the central bounds hold, the work supplies the first finite-time robustness guarantees for asynchronous Q-learning, a setting that is practically important because real-world data collection is often asynchronous and time-correlated. The information-theoretic lower bound and the distribution-agnostic property strengthen the contribution. The refined concentration inequality for almost-martingales is presented as potentially reusable and is therefore a positive technical feature.

major comments (1)
  1. [Section 4] Section 4 (main analysis) and the statement of the refined Azuma-Hoeffding inequality: the proof must explicitly verify that the almost-martingale property and bounded-difference conditions continue to hold when the asynchronous sampling process (which already induces temporal correlations) is combined with adversarial corruptions on a fraction of the observed rewards. The additive corruption term in the sample-complexity bound rests on this joint applicability; without a self-contained argument showing that corruptions do not introduce uncontrolled deviations outside the almost-martingale framework, the claimed near-optimality is not yet justified.
minor comments (2)
  1. [Abstract] The abstract and introduction could state the precise dependence of the additive term on the corruption fraction (e.g., whether it is linear in the fraction or involves additional logarithmic factors) to make the near-optimality claim immediately quantifiable.
  2. [Section 2] Notation for the corruption model (e.g., the indicator for corrupted samples and the bound on the corruption magnitude) should be introduced once in a dedicated preliminary subsection rather than inline in the analysis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The feedback on the technical details of the analysis in Section 4 is appreciated, and we address the major comment below.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (main analysis) and the statement of the refined Azuma-Hoeffding inequality: the proof must explicitly verify that the almost-martingale property and bounded-difference conditions continue to hold when the asynchronous sampling process (which already induces temporal correlations) is combined with adversarial corruptions on a fraction of the observed rewards. The additive corruption term in the sample-complexity bound rests on this joint applicability; without a self-contained argument showing that corruptions do not introduce uncontrolled deviations outside the almost-martingale framework, the claimed near-optimality is not yet justified.

    Authors: We thank the referee for pointing out the need for greater explicitness on this technical point. In the current manuscript, the refined Azuma-Hoeffding inequality is applied to the almost-martingale difference sequence generated by the stochastic approximation errors under asynchronous sampling; the adversarial corruptions are isolated as a separate additive perturbation whose total contribution is bounded by the corruption fraction times the reward range. Because the corruptions affect only a bounded fraction of samples and do not alter the conditional-expectation structure of the clean samples with respect to the filtration induced by the asynchronous process, the almost-martingale and bounded-difference properties are preserved for the noise term. Nevertheless, we agree that a self-contained verification of this joint applicability would strengthen the presentation. In the revised version we will insert a new supporting lemma immediately preceding the main convergence argument in Section 4. The lemma will (i) recall the definition of the almost-martingale under asynchronous sampling, (ii) show that the corruption indicator process contributes only a deterministic bias that is absorbed into the additive corruption term of the final bound, and (iii) verify that the bounded-difference condition continues to hold uniformly over the combined process. This addition will make the justification of the near-optimal rate fully explicit without changing any of the stated theorems. revision: yes

Circularity Check

0 steps flagged

Derivation relies on external concentration inequality and independent lower bound; no reduction to fitted parameters or self-referential definitions.

full rationale

The paper introduces a robust Q-learning variant and derives finite-time bounds by applying a refined Azuma-Hoeffding inequality to almost-martingales under asynchronous sampling and limited adversarial corruption. It separately establishes an information-theoretic lower bound to show near-optimality. No step equates a claimed prediction or rate to a quantity defined by the algorithm's own fitted parameters, nor does any central claim reduce to a self-citation chain that itself assumes the target result. The analysis is presented as self-contained against standard RL benchmarks, with the inequality invoked as a technical tool rather than derived from the paper's outputs. This yields a normal non-finding of circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review is based only on the abstract, so the ledger reflects standard assumptions inferred from the described RL setting and corruption model; full paper would be needed to list all invoked lemmas or background results.

axioms (2)
  • domain assumption Discounted infinite-horizon MDP with asynchronous sampling and time-correlated observations
    The problem setting stated in the abstract.
  • domain assumption Adversarial reward corruption affecting at most a known fraction of samples
    The corruption model under which the robustness guarantees are claimed.

pith-pipeline@v0.9.0 · 5686 in / 1354 out tokens · 40763 ms · 2026-05-22T12:55:05.375656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 7 internal anchors

  1. [1]

    A review of reinforcement learning in financial applications.Annual Review of Statistics and Its Application, 12(1):209–232, 2025

    Yahui Bai, Yuhe Gao, Runzhe Wan, Sheng Zhang, and Rui Song. A review of reinforcement learning in financial applications.Annual Review of Statistics and Its Application, 12(1):209–232, 2025

  2. [2]

    Reinforcement learning in personalized medicine: A comprehensive review of treatment optimization strategies.Cureus, 17(4), 2025

    K Banumathi, Latha Venkatesan, Lizy Sonia Benjamin, K Vijayalakshmi, Nesa Sathya Satchi, and Nesa Sathya Satchi IV. Reinforcement learning in personalized medicine: A comprehensive review of treatment optimization strategies.Cureus, 17(4), 2025

  3. [3]

    Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38, 2022

    M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38, 2022

  4. [4]

    Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

    Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving.arXiv preprint arXiv:1610.03295, 2016

  5. [5]

    Reinforcement learning in robotics: A survey

    Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013

  6. [6]

    End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research, 17(1):1334–1373, 2016

    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research, 17(1):1334–1373, 2016

  7. [7]

    Exploration-driven policy optimization in rlhf: Theoretical insights on efficient data utilization.arXiv preprint arXiv:2402.10342, 2024

    Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, and R Srikant. Exploration-driven policy optimization in rlhf: Theoretical insights on efficient data utilization.arXiv preprint arXiv:2402.10342, 2024

  8. [8]

    Corruption robust offline reinforcement learning with human feedback.arXiv preprint arXiv:2402.06734, 2024

    Debmalya Mandal, Andi Nika, Parameswaran Kamalaruban, Adish Singla, and Goran Radanović. Corruption robust offline reinforcement learning with human feedback.arXiv preprint arXiv:2402.06734, 2024

  9. [9]

    Provably robust dpo: Aligning language models with noisy feed- back,

    Sayak Ray Chowdhury, Anush Kini, and Nagarajan Natarajan. Provably robust dpo: Aligning language models with noisy feedback.arXiv preprint arXiv:2403.00409, 2024

  10. [10]

    Learning product rankings robust to fake users

    Negin Golrezaei, Vahideh Manshadi, Jon Schneider, and Shreyas Sekar. Learning product rankings robust to fake users. InProceedings of the 22nd ACM Conference on Economics and Computation, pages 560–561, 2021

  11. [11]

    Identifying and treating outliers in finance.Financial Management, 48(2):345–384, 2019

    John Adams, Darren Hayunga, Sattar Mansi, David Reeb, and Vincenzo Verardi. Identifying and treating outliers in finance.Financial Management, 48(2):345–384, 2019. 19

  12. [12]

    A systems and control perspective of CPS security.Annual reviews in control, 47:394–411, 2019

    Seyed Mehran Dibaji, Mohammad Pirani, David Bezalel Flamholz, Anuradha M Annaswamy, Karl Henrik Johansson, and Aranya Chakrabortty. A systems and control perspective of CPS security.Annual reviews in control, 47:394–411, 2019

  13. [13]

    Q-learning.Machine learning, 8:279–292, 1992

    Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8:279–292, 1992

  14. [14]

    Robust estimation of a location parameter

    Peter J Huber. Robust estimation of a location parameter. InBreakthroughs in statistics, pages 492–518. Springer, 1992

  15. [15]

    John Wiley & Sons, 2004

    Peter J Huber.Robust statistics, volume 523. John Wiley & Sons, 2004

  16. [16]

    Adversarial Attacks on Stochastic Bandits

    Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Xiaojin Zhu. Adversarial attacks on stochastic bandits.arXiv preprint arXiv:1810.12188, 2018

  17. [17]

    Stochastic bandits robust to adversarial corruptions

    Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. InProceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114–122, 2018

  18. [18]

    Data poisoning attacks on stochastic bandits

    Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. InInternational Conference on Machine Learning, pages 4042–4050. PMLR, 2019

  19. [19]

    Better algorithms for stochastic bandits with adversarial corruptions

    Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. InConference on Learning Theory, pages 1562–1578. PMLR, 2019

  20. [20]

    Corruption-tolerant bandit learning.Machine Learning, 108(4):687–715, 2019

    Sayash Kapoor, Kumar Kshitij Patel, and Purushottam Kar. Corruption-tolerant bandit learning.Machine Learning, 108(4):687–715, 2019

  21. [21]

    Crimed: Lower and upper bounds on regret for bandits with unbounded stochastic corruption

    Shubhada Agrawal, Timothée Mathieu, Debabrota Basu, and Odalric-Ambrym Maillard. Crimed: Lower and upper bounds on regret for bandits with unbounded stochastic corruption. In International Conference on Algorithmic Learning Theory, pages 74–124. PMLR, 2024

  22. [22]

    Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

    Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

  23. [23]

    Robust multivariate mean estimation: the optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021

    Gabor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021

  24. [24]

    On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence

    Nathaniel Korda and Prashanth La. On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence. InInternational conference on machine learning, pages 626–634. PMLR, 2015

  25. [25]

    Finite sample analyses for TD (0) with function approximation

    Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for TD (0) with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

  26. [26]

    state-of-the-art

    C Narayanan and Csaba Szepesvári. Finite time bounds for temporal difference learning with function approximation: Problems with some “state-of-the-art” results. Technical report, Technical report, 2017

  27. [27]

    Linear Stochastic Approximation: Constant Step-Size and Iterate Averaging

    Chandrashekar Lakshminarayanan and Csaba Szepesvári. Linear stochastic approximation: Constant step-size and iterate averaging.arXiv preprint arXiv:1709.04073, 2017

  28. [28]

    Cambridge university press, 2019

    Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019. 20

  29. [29]

    Finite-time analysis of asynchronous stochastic approximation and Q-learning.Proceedings of Machine Learning Research, 125:1–21, 2020

    Adam Wierman Guannan Qu. Finite-time analysis of asynchronous stochastic approximation and Q-learning.Proceedings of Machine Learning Research, 125:1–21, 2020

  30. [30]

    Is Q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

    Gen Li, Changxiao Cai, Yuxin Chen, Yuting Wei, and Yuejie Chi. Is Q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

  31. [31]

    Concentration inequalities and martingale inequalities: a survey

    Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1):79–127, 2006

  32. [32]

    A Variant of Azuma's Inequality for Martingales with Subgaussian Tails

    Ohad Shamir. A variant of Azuma’s inequality for martingales with subgaussian tails.arXiv preprint arXiv:1110.2392, 2011

  33. [33]

    Sharp concentration of the chromatic number on random graphs gn,p.Combinatorica, 7(1):121–129, Mar 1987

    Eli Shamir and Joel Spencer. Sharp concentration of the chromatic number on random graphs gn,p.Combinatorica, 7(1):121–129, Mar 1987

  34. [34]

    Adapting to mixing time in stochastic optimization with Markovian data

    Ron Dorfman and Kfir Yehuda Levy. Adapting to mixing time in stochastic optimization with Markovian data. InInternational Conference on Machine Learning, pages 5429–5446. PMLR, 2022

  35. [35]

    Least squares regression with Markovian data: Fundamental limits and algorithms.Advances in neural information processing systems, 33:16666–16676, 2020

    Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with Markovian data: Fundamental limits and algorithms.Advances in neural information processing systems, 33:16666–16676, 2020

  36. [36]

    Stochastic linear bandits robust to adversarial attacks

    Ilija Bogunovic, Arpan Losalka, Andreas Krause, and Jonathan Scarlett. Stochastic linear bandits robust to adversarial attacks. InInternational Conference on Artificial Intelligence and Statistics, pages 991–999. PMLR, 2021

  37. [37]

    Corruption-tolerant gaussian process bandit optimization

    Ilija Bogunovic, Andreas Krause, and Jonathan Scarlett. Corruption-tolerant gaussian process bandit optimization. InInternational Conference on Artificial Intelligence and Statistics, pages 1071–1081. PMLR, 2020

  38. [38]

    Adversarial attacks on linear contextual bandits.arXiv preprint arXiv:2002.03839, 2020

    Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, and Matteo Pirotta. Adversarial attacks on linear contextual bandits.arXiv preprint arXiv:2002.03839, 2020

  39. [39]

    Nearly optimal algorithms for linear contextual bandits with adversarial corruptions.arXiv preprint arXiv:2205.06811, 2022

    Jiafan He, Dongruo Zhou, Tong Zhang, and Quanquan Gu. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions.arXiv preprint arXiv:2205.06811, 2022

  40. [40]

    Corruption-robust exploration in episodic reinforcement learning

    Thodoris Lykouris, Max Simchowitz, Alex Slivkins, and Wen Sun. Corruption-robust exploration in episodic reinforcement learning. InConference on Learning Theory, pages 3242–3245. PMLR, 2021

  41. [41]

    Improved corruption robust algorithms for episodic reinforcement learning

    Yifang Chen, Simon Du, and Kevin Jamieson. Improved corruption robust algorithms for episodic reinforcement learning. InInternational Conference on Machine Learning, pages 1561–1570. PMLR, 2021

  42. [42]

    A model selection approach for corruption robust reinforcement learning

    Chen-Yu Wei, Christoph Dann, and Julian Zimmert. A model selection approach for corruption robust reinforcement learning. InInternational Conference on Algorithmic Learning Theory, pages 1043–1096. PMLR, 2022

  43. [43]

    Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes

    Chenlu Ye, Wei Xiong, Quanquan Gu, and Tong Zhang. Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes. In International Conference on Machine Learning, pages 39834–39863. PMLR, 2023. 21

  44. [44]

    Corruption-robust offline reinforcement learning

    Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, and Wen Sun. Corruption-robust offline reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 5757–5773. PMLR, 2022

  45. [45]

    Corruption-robust offline reinforcement learning with general function approximation.Advances in Neural Information Processing Systems, 36:36208–36221, 2023

    Chenlu Ye, Rui Yang, Quanquan Gu, and Tong Zhang. Corruption-robust offline reinforcement learning with general function approximation.Advances in Neural Information Processing Systems, 36:36208–36221, 2023

  46. [46]

    Robust Q-learning under corrupted rewards

    Sreejeet Maity and Aritra Mitra. Robust Q-learning under corrupted rewards. In2024 IEEE 63rd Conference on Decision and Control (CDC), pages 1181–1186. IEEE, 2024

  47. [47]

    Robust offline reinforcement learning with heavy-tailed rewards

    Jin Zhu, Runzhe Wan, Zhengling Qi, Shikai Luo, and Chengchun Shi. Robust offline reinforcement learning with heavy-tailed rewards. InInternational Conference on Artificial Intelligence and Statistics, pages 541–549. PMLR, 2024

  48. [48]

    No-regret reinforcement learning with heavy-tailed rewards

    Vincent Zhuang and Yanan Sui. No-regret reinforcement learning with heavy-tailed rewards. InInternational Conference on Artificial Intelligence and Statistics, pages 3385–3393. PMLR, 2021

  49. [49]

    Provably robust temporal difference learning for heavy-tailed rewards.Advances in Neural Information Processing Systems, 36, 2024

    Semih Cayci and Atilla Eryilmaz. Provably robust temporal difference learning for heavy-tailed rewards.Advances in Neural Information Processing Systems, 36, 2024

  50. [50]

    A finite time analysis of temporal difference learning with linear function approximation

    Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR, 2018

  51. [51]

    Adversarially-robust TD learning with Markovian data: Finite-time rates and fundamental limits.arXiv preprint arXiv:2502.04662, 2025

    Sreejeet Maity and Aritra Mitra. Adversarially-robust TD learning with Markovian data: Finite-time rates and fundamental limits.arXiv preprint arXiv:2502.04662, 2025

  52. [52]

    Asynchronous stochastic approximation and Q-learning.Machine learning, 16:185–202, 1994

    John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16:185–202, 1994

  53. [53]

    Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

    Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

  54. [54]

    MIT press, 2018

    Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018

  55. [55]

    An analysis of temporal-difference learning with function approximation

    John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. InIEEE Transactions on Automatic Control, 1997

  56. [56]

    Mean estimation and regression under heavy-tailed distributions: A survey.Found

    Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey.Found. Comput. Math., 19(5):1145–1190, October 2019

  57. [57]

    Learning from untrusted data

    Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pages 47–60, 2017

  58. [58]

    Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

    Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharpℓ∞- bounds forQ-learning.arXiv preprint arXiv:1905.06265, 2019

  59. [59]

    Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998

    Michael Kearns and Satinder Singh. Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998. 22

  60. [60]

    Learning rates for Q-learning.Journal of machine learning Research, 5(1), 2003

    Eyal Even-Dar, Yishay Mansour, and Peter Bartlett. Learning rates for Q-learning.Journal of machine learning Research, 5(1), 2003

  61. [61]

    Near-optimal time and sample complexities for solving Markov decision processes with a generative model.Advances in Neural Information Processing Systems, 31, 2018

    Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model.Advances in Neural Information Processing Systems, 31, 2018

  62. [62]

    Robust Covariance and Scatter Matrix Estimation under Huber's Contamination Model

    Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance matrix estimation via matrix depth.arXiv preprint arXiv:1506.00691, 2015

  63. [63]

    Agnostic estimation of mean and covariance

    Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016

  64. [64]

    High-dimensional robust mean estimation in nearly-linear time

    Yu Cheng, Ilias Diakonikolas, and Rong Ge. High-dimensional robust mean estimation in nearly-linear time. InProc. of the thirtieth annual ACM-SIAM symp. on discrete algorithms, pages 2755–2771. SIAM, 2019

  65. [65]

    Dalalyan and Arshak Minasyan

    Arnak S. Dalalyan and Arshak Minasyan. All-in-one robust estimator of the gaussian mean. The Annals of Statistics, 2022

  66. [66]

    Performance of Q-learning with linear function approximation: Stability and finite-time analysis

    Zaiwei Chen, Sheng Zhang, Thinh T Doan, Siva Theja Maguluri, and John-Paul Clarke. Performance of Q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, page 4, 2019

  67. [67]

    Finite-time error bounds for linear stochastic approximation and TD learning

    Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD learning. InConference on Learning Theory, pages 2803–2830. PMLR, 2019

  68. [68]

    Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

    Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

  69. [69]

    Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

    Gandharv Patil, LA Prashanth, Dheeraj Nagaraj, and Doina Precup. Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation. InInternational Conference on Artificial Intelligence and Statistics, pages 5438–5448. PMLR, 2023

  70. [70]

    A simple finite-time analysis of TD learning with linear function approximation

    Aritra Mitra. A simple finite-time analysis of TD learning with linear function approximation. IEEE Transactions on Automatic Control, 70(2):1388–1394, 2024

  71. [71]

    Estimating the mixing time of ergodic markov chains

    Geoffrey Wolfer and Aryeh Kontorovich. Estimating the mixing time of ergodic markov chains. InConference on Learning Theory, pages 3120–3159. PMLR, 2019

  72. [72]

    Mixing time estimation in reversible markov chains from a single sample path.Advances in neural information processing systems, 28, 2015

    Daniel J Hsu, Aryeh Kontorovich, and Csaba Szepesvári. Mixing time estimation in reversible markov chains from a single sample path.Advances in neural information processing systems, 28, 2015

  73. [73]

    Robust estimation of discrete distributions under local differential privacy

    Julien Chhor and Flore Sentenac. Robust estimation of discrete distributions under local differential privacy. InInternational Conference on Algorithmic Learning Theory, pages 411–446. PMLR, 2023

  74. [74]

    Robust online and distributed mean estimation under adversarial data corruption

    Tong Yao and Shreyas Sundaram. Robust online and distributed mean estimation under adversarial data corruption. In2022 IEEE 61st Conference on Decision and Control (CDC), pages 4193–4198. IEEE, 2022. 23

  75. [75]

    Outlier-robust linear system identification under heavy-tailed noise.arXiv preprint arXiv:2501.00421, 2024

    Vinay Kanakeri and Aritra Mitra. Outlier-robust linear system identification under heavy-tailed noise.arXiv preprint arXiv:2501.00421, 2024

  76. [76]

    Boosting-enabled robust system identification of partially observed lti systems under heavy-tailed noise.arXiv preprint arXiv:2504.18444, 2025

    Vinay Kanakeri and Aritra Mitra. Boosting-enabled robust system identification of partially observed lti systems under heavy-tailed noise.arXiv preprint arXiv:2504.18444, 2025

  77. [77]

    Springer, 2009

    Vivek S Borkar.Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009

  78. [78]

    The ode method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

    Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

  79. [79]

    Q-learning with nearest neighbors.Advances in Neural Information Processing Systems, 31, 2018

    Devavrat Shah and Qiaomin Xie. Q-learning with nearest neighbors.Advances in Neural Information Processing Systems, 31, 2018

  80. [80]

    Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling

    LA Prashanth, Nathaniel Korda, and Rémi Munos. Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling. Machine Learning, 110(3):559–618, 2021

Showing first 80 references.