Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates
Pith reviewed 2026-05-22 12:55 UTC · model grok-4.3
The pith
Despite adversarial reward corruption, a robust asynchronous Q-learning algorithm achieves finite-time convergence rates matching standard bounds up to an additive term that scales with the corruption fraction, and these rates are shown to
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A novel robust Q-learning algorithm attains finite-time convergence to the optimal action-value function at rates that remain near-optimal under adversarial reward corruptions, matching existing asynchronous Q-learning bounds up to an additive term proportional to the corruption fraction; an information-theoretic lower bound confirms that no algorithm can do substantially better, and the guarantees hold without assumptions on the reward distribution.
What carries the argument
A refined Azuma-Hoeffding inequality for almost-martingales that controls concentration of the time-correlated, partially corrupted samples arising in asynchronous Q-learning.
If this is right
- The algorithm remains agnostic to the reward distribution while still delivering the stated guarantees.
- The same finite-time robustness extends to the asynchronous sampling model with temporal correlations.
- Matching lower bounds establish that the derived rates cannot be improved by more than constants.
- The refined concentration tool may be reusable for other RL algorithms that face similar sampling and corruption issues.
Where Pith is reading between the lines
- Similar robustification techniques could be applied to other value-based methods such as SARSA or fitted Q-iteration.
- In deployed systems the results suggest that modest corruption fractions incur only linear rather than catastrophic degradation.
- The approach may inform the design of RL algorithms that must operate with noisy or manipulated feedback from human users or sensors.
Load-bearing premise
A fixed but unknown fraction of the observed rewards may be arbitrarily corrupted by an adversary, and the almost-martingale structure induced by asynchronous sampling must satisfy the refined concentration inequality used in the analysis.
What would settle it
An experiment that measures the sample complexity needed to reach a given accuracy level while corrupting a known fraction epsilon of rewards and checks whether the observed complexity exceeds the standard bound by more than a constant times epsilon.
Figures
read the original abstract
We study the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting in the presence of adversarially corrupted rewards. To address this problem, we develop a novel robust variant of the \(Q\)-learning algorithm and analyze it under the challenging asynchronous sampling model with time-correlated data. Despite corruption, we prove that the finite-time guarantees of our approach match existing bounds, up to an additive term that scales with the fraction of corrupted samples. We also establish an information-theoretic lower bound, revealing that our guarantees are near-optimal. Notably, our algorithm is agnostic to the underlying reward distribution and provides the first finite-time robustness guarantees for asynchronous \(Q\)-learning. A key element of our analysis is a refined Azuma-Hoeffding inequality for almost-martingales, which may have broader applicability in the study of RL algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a robust variant of asynchronous Q-learning for discounted infinite-horizon MDPs subject to adversarial reward corruptions affecting a bounded fraction of samples. It claims finite-time convergence guarantees that recover the best known rates for the uncorrupted asynchronous case up to an additive term linear in the corruption fraction, establishes a matching information-theoretic lower bound, and introduces a refined Azuma-Hoeffding inequality for almost-martingales as the key technical tool. The algorithm is distribution-agnostic and provides the first such finite-time robustness results under asynchronous sampling.
Significance. If the central bounds hold, the work supplies the first finite-time robustness guarantees for asynchronous Q-learning, a setting that is practically important because real-world data collection is often asynchronous and time-correlated. The information-theoretic lower bound and the distribution-agnostic property strengthen the contribution. The refined concentration inequality for almost-martingales is presented as potentially reusable and is therefore a positive technical feature.
major comments (1)
- [Section 4] Section 4 (main analysis) and the statement of the refined Azuma-Hoeffding inequality: the proof must explicitly verify that the almost-martingale property and bounded-difference conditions continue to hold when the asynchronous sampling process (which already induces temporal correlations) is combined with adversarial corruptions on a fraction of the observed rewards. The additive corruption term in the sample-complexity bound rests on this joint applicability; without a self-contained argument showing that corruptions do not introduce uncontrolled deviations outside the almost-martingale framework, the claimed near-optimality is not yet justified.
minor comments (2)
- [Abstract] The abstract and introduction could state the precise dependence of the additive term on the corruption fraction (e.g., whether it is linear in the fraction or involves additional logarithmic factors) to make the near-optimality claim immediately quantifiable.
- [Section 2] Notation for the corruption model (e.g., the indicator for corrupted samples and the bound on the corruption magnitude) should be introduced once in a dedicated preliminary subsection rather than inline in the analysis.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The feedback on the technical details of the analysis in Section 4 is appreciated, and we address the major comment below.
read point-by-point responses
-
Referee: [Section 4] Section 4 (main analysis) and the statement of the refined Azuma-Hoeffding inequality: the proof must explicitly verify that the almost-martingale property and bounded-difference conditions continue to hold when the asynchronous sampling process (which already induces temporal correlations) is combined with adversarial corruptions on a fraction of the observed rewards. The additive corruption term in the sample-complexity bound rests on this joint applicability; without a self-contained argument showing that corruptions do not introduce uncontrolled deviations outside the almost-martingale framework, the claimed near-optimality is not yet justified.
Authors: We thank the referee for pointing out the need for greater explicitness on this technical point. In the current manuscript, the refined Azuma-Hoeffding inequality is applied to the almost-martingale difference sequence generated by the stochastic approximation errors under asynchronous sampling; the adversarial corruptions are isolated as a separate additive perturbation whose total contribution is bounded by the corruption fraction times the reward range. Because the corruptions affect only a bounded fraction of samples and do not alter the conditional-expectation structure of the clean samples with respect to the filtration induced by the asynchronous process, the almost-martingale and bounded-difference properties are preserved for the noise term. Nevertheless, we agree that a self-contained verification of this joint applicability would strengthen the presentation. In the revised version we will insert a new supporting lemma immediately preceding the main convergence argument in Section 4. The lemma will (i) recall the definition of the almost-martingale under asynchronous sampling, (ii) show that the corruption indicator process contributes only a deterministic bias that is absorbed into the additive corruption term of the final bound, and (iii) verify that the bounded-difference condition continues to hold uniformly over the combined process. This addition will make the justification of the near-optimal rate fully explicit without changing any of the stated theorems. revision: yes
Circularity Check
Derivation relies on external concentration inequality and independent lower bound; no reduction to fitted parameters or self-referential definitions.
full rationale
The paper introduces a robust Q-learning variant and derives finite-time bounds by applying a refined Azuma-Hoeffding inequality to almost-martingales under asynchronous sampling and limited adversarial corruption. It separately establishes an information-theoretic lower bound to show near-optimality. No step equates a claimed prediction or rate to a quantity defined by the algorithm's own fitted parameters, nor does any central claim reduce to a self-citation chain that itself assumes the target result. The analysis is presented as self-contained against standard RL benchmarks, with the inequality invoked as a technical tool rather than derived from the paper's outputs. This yields a normal non-finding of circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Discounted infinite-horizon MDP with asynchronous sampling and time-correlated observations
- domain assumption Adversarial reward corruption affecting at most a known fraction of samples
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A key element of our analysis is a refined Azuma-Hoeffding inequality for almost-martingales... (Theorem 5, Appendix G)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
finite-time convergence rate... ˜O(1/√T) up to additive O(√ε) term (Theorems 2,4,6)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Yahui Bai, Yuhe Gao, Runzhe Wan, Sheng Zhang, and Rui Song. A review of reinforcement learning in financial applications.Annual Review of Statistics and Its Application, 12(1):209–232, 2025
work page 2025
-
[2]
K Banumathi, Latha Venkatesan, Lizy Sonia Benjamin, K Vijayalakshmi, Nesa Sathya Satchi, and Nesa Sathya Satchi IV. Reinforcement learning in personalized medicine: A comprehensive review of treatment optimization strategies.Cureus, 17(4), 2025
work page 2025
-
[3]
Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38, 2022
M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38, 2022
work page 2022
-
[4]
Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving
Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving.arXiv preprint arXiv:1610.03295, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
Reinforcement learning in robotics: A survey
Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013
work page 2013
-
[6]
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research, 17(1):1334–1373, 2016
work page 2016
-
[7]
Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, and R Srikant. Exploration-driven policy optimization in rlhf: Theoretical insights on efficient data utilization.arXiv preprint arXiv:2402.10342, 2024
-
[8]
Debmalya Mandal, Andi Nika, Parameswaran Kamalaruban, Adish Singla, and Goran Radanović. Corruption robust offline reinforcement learning with human feedback.arXiv preprint arXiv:2402.06734, 2024
-
[9]
Provably robust dpo: Aligning language models with noisy feed- back,
Sayak Ray Chowdhury, Anush Kini, and Nagarajan Natarajan. Provably robust dpo: Aligning language models with noisy feedback.arXiv preprint arXiv:2403.00409, 2024
-
[10]
Learning product rankings robust to fake users
Negin Golrezaei, Vahideh Manshadi, Jon Schneider, and Shreyas Sekar. Learning product rankings robust to fake users. InProceedings of the 22nd ACM Conference on Economics and Computation, pages 560–561, 2021
work page 2021
-
[11]
Identifying and treating outliers in finance.Financial Management, 48(2):345–384, 2019
John Adams, Darren Hayunga, Sattar Mansi, David Reeb, and Vincenzo Verardi. Identifying and treating outliers in finance.Financial Management, 48(2):345–384, 2019. 19
work page 2019
-
[12]
A systems and control perspective of CPS security.Annual reviews in control, 47:394–411, 2019
Seyed Mehran Dibaji, Mohammad Pirani, David Bezalel Flamholz, Anuradha M Annaswamy, Karl Henrik Johansson, and Aranya Chakrabortty. A systems and control perspective of CPS security.Annual reviews in control, 47:394–411, 2019
work page 2019
-
[13]
Q-learning.Machine learning, 8:279–292, 1992
Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8:279–292, 1992
work page 1992
-
[14]
Robust estimation of a location parameter
Peter J Huber. Robust estimation of a location parameter. InBreakthroughs in statistics, pages 492–518. Springer, 1992
work page 1992
-
[15]
Peter J Huber.Robust statistics, volume 523. John Wiley & Sons, 2004
work page 2004
-
[16]
Adversarial Attacks on Stochastic Bandits
Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Xiaojin Zhu. Adversarial attacks on stochastic bandits.arXiv preprint arXiv:1810.12188, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Stochastic bandits robust to adversarial corruptions
Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. InProceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114–122, 2018
work page 2018
-
[18]
Data poisoning attacks on stochastic bandits
Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. InInternational Conference on Machine Learning, pages 4042–4050. PMLR, 2019
work page 2019
-
[19]
Better algorithms for stochastic bandits with adversarial corruptions
Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. InConference on Learning Theory, pages 1562–1578. PMLR, 2019
work page 2019
-
[20]
Corruption-tolerant bandit learning.Machine Learning, 108(4):687–715, 2019
Sayash Kapoor, Kumar Kshitij Patel, and Purushottam Kar. Corruption-tolerant bandit learning.Machine Learning, 108(4):687–715, 2019
work page 2019
-
[21]
Crimed: Lower and upper bounds on regret for bandits with unbounded stochastic corruption
Shubhada Agrawal, Timothée Mathieu, Debabrota Basu, and Odalric-Ambrym Maillard. Crimed: Lower and upper bounds on regret for bandits with unbounded stochastic corruption. In International Conference on Algorithmic Learning Theory, pages 74–124. PMLR, 2024
work page 2024
-
[22]
Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988
Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988
work page 1988
-
[23]
Gabor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021
work page 2021
-
[24]
Nathaniel Korda and Prashanth La. On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence. InInternational conference on machine learning, pages 626–634. PMLR, 2015
work page 2015
-
[25]
Finite sample analyses for TD (0) with function approximation
Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for TD (0) with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018
work page 2018
-
[26]
C Narayanan and Csaba Szepesvári. Finite time bounds for temporal difference learning with function approximation: Problems with some “state-of-the-art” results. Technical report, Technical report, 2017
work page 2017
-
[27]
Linear Stochastic Approximation: Constant Step-Size and Iterate Averaging
Chandrashekar Lakshminarayanan and Csaba Szepesvári. Linear stochastic approximation: Constant step-size and iterate averaging.arXiv preprint arXiv:1709.04073, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
Cambridge university press, 2019
Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019. 20
work page 2019
-
[29]
Adam Wierman Guannan Qu. Finite-time analysis of asynchronous stochastic approximation and Q-learning.Proceedings of Machine Learning Research, 125:1–21, 2020
work page 2020
-
[30]
Gen Li, Changxiao Cai, Yuxin Chen, Yuting Wei, and Yuejie Chi. Is Q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024
work page 2024
-
[31]
Concentration inequalities and martingale inequalities: a survey
Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1):79–127, 2006
work page 2006
-
[32]
A Variant of Azuma's Inequality for Martingales with Subgaussian Tails
Ohad Shamir. A variant of Azuma’s inequality for martingales with subgaussian tails.arXiv preprint arXiv:1110.2392, 2011
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[33]
Eli Shamir and Joel Spencer. Sharp concentration of the chromatic number on random graphs gn,p.Combinatorica, 7(1):121–129, Mar 1987
work page 1987
-
[34]
Adapting to mixing time in stochastic optimization with Markovian data
Ron Dorfman and Kfir Yehuda Levy. Adapting to mixing time in stochastic optimization with Markovian data. InInternational Conference on Machine Learning, pages 5429–5446. PMLR, 2022
work page 2022
-
[35]
Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with Markovian data: Fundamental limits and algorithms.Advances in neural information processing systems, 33:16666–16676, 2020
work page 2020
-
[36]
Stochastic linear bandits robust to adversarial attacks
Ilija Bogunovic, Arpan Losalka, Andreas Krause, and Jonathan Scarlett. Stochastic linear bandits robust to adversarial attacks. InInternational Conference on Artificial Intelligence and Statistics, pages 991–999. PMLR, 2021
work page 2021
-
[37]
Corruption-tolerant gaussian process bandit optimization
Ilija Bogunovic, Andreas Krause, and Jonathan Scarlett. Corruption-tolerant gaussian process bandit optimization. InInternational Conference on Artificial Intelligence and Statistics, pages 1071–1081. PMLR, 2020
work page 2020
-
[38]
Adversarial attacks on linear contextual bandits.arXiv preprint arXiv:2002.03839, 2020
Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, and Matteo Pirotta. Adversarial attacks on linear contextual bandits.arXiv preprint arXiv:2002.03839, 2020
-
[39]
Jiafan He, Dongruo Zhou, Tong Zhang, and Quanquan Gu. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions.arXiv preprint arXiv:2205.06811, 2022
-
[40]
Corruption-robust exploration in episodic reinforcement learning
Thodoris Lykouris, Max Simchowitz, Alex Slivkins, and Wen Sun. Corruption-robust exploration in episodic reinforcement learning. InConference on Learning Theory, pages 3242–3245. PMLR, 2021
work page 2021
-
[41]
Improved corruption robust algorithms for episodic reinforcement learning
Yifang Chen, Simon Du, and Kevin Jamieson. Improved corruption robust algorithms for episodic reinforcement learning. InInternational Conference on Machine Learning, pages 1561–1570. PMLR, 2021
work page 2021
-
[42]
A model selection approach for corruption robust reinforcement learning
Chen-Yu Wei, Christoph Dann, and Julian Zimmert. A model selection approach for corruption robust reinforcement learning. InInternational Conference on Algorithmic Learning Theory, pages 1043–1096. PMLR, 2022
work page 2022
-
[43]
Chenlu Ye, Wei Xiong, Quanquan Gu, and Tong Zhang. Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes. In International Conference on Machine Learning, pages 39834–39863. PMLR, 2023. 21
work page 2023
-
[44]
Corruption-robust offline reinforcement learning
Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, and Wen Sun. Corruption-robust offline reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 5757–5773. PMLR, 2022
work page 2022
-
[45]
Chenlu Ye, Rui Yang, Quanquan Gu, and Tong Zhang. Corruption-robust offline reinforcement learning with general function approximation.Advances in Neural Information Processing Systems, 36:36208–36221, 2023
work page 2023
-
[46]
Robust Q-learning under corrupted rewards
Sreejeet Maity and Aritra Mitra. Robust Q-learning under corrupted rewards. In2024 IEEE 63rd Conference on Decision and Control (CDC), pages 1181–1186. IEEE, 2024
work page 2024
-
[47]
Robust offline reinforcement learning with heavy-tailed rewards
Jin Zhu, Runzhe Wan, Zhengling Qi, Shikai Luo, and Chengchun Shi. Robust offline reinforcement learning with heavy-tailed rewards. InInternational Conference on Artificial Intelligence and Statistics, pages 541–549. PMLR, 2024
work page 2024
-
[48]
No-regret reinforcement learning with heavy-tailed rewards
Vincent Zhuang and Yanan Sui. No-regret reinforcement learning with heavy-tailed rewards. InInternational Conference on Artificial Intelligence and Statistics, pages 3385–3393. PMLR, 2021
work page 2021
-
[49]
Semih Cayci and Atilla Eryilmaz. Provably robust temporal difference learning for heavy-tailed rewards.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[50]
A finite time analysis of temporal difference learning with linear function approximation
Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR, 2018
work page 2018
-
[51]
Sreejeet Maity and Aritra Mitra. Adversarially-robust TD learning with Markovian data: Finite-time rates and fundamental limits.arXiv preprint arXiv:2502.04662, 2025
-
[52]
Asynchronous stochastic approximation and Q-learning.Machine learning, 16:185–202, 1994
John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16:185–202, 1994
work page 1994
-
[53]
Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993
work page 1993
-
[54]
Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018
work page 2018
-
[55]
An analysis of temporal-difference learning with function approximation
John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. InIEEE Transactions on Automatic Control, 1997
work page 1997
-
[56]
Mean estimation and regression under heavy-tailed distributions: A survey.Found
Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey.Found. Comput. Math., 19(5):1145–1190, October 2019
work page 2019
-
[57]
Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pages 47–60, 2017
work page 2017
-
[58]
Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharpℓ∞- bounds forQ-learning.arXiv preprint arXiv:1905.06265, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[59]
Michael Kearns and Satinder Singh. Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998. 22
work page 1998
-
[60]
Learning rates for Q-learning.Journal of machine learning Research, 5(1), 2003
Eyal Even-Dar, Yishay Mansour, and Peter Bartlett. Learning rates for Q-learning.Journal of machine learning Research, 5(1), 2003
work page 2003
-
[61]
Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model.Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[62]
Robust Covariance and Scatter Matrix Estimation under Huber's Contamination Model
Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance matrix estimation via matrix depth.arXiv preprint arXiv:1506.00691, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[63]
Agnostic estimation of mean and covariance
Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016
work page 2016
-
[64]
High-dimensional robust mean estimation in nearly-linear time
Yu Cheng, Ilias Diakonikolas, and Rong Ge. High-dimensional robust mean estimation in nearly-linear time. InProc. of the thirtieth annual ACM-SIAM symp. on discrete algorithms, pages 2755–2771. SIAM, 2019
work page 2019
-
[65]
Arnak S. Dalalyan and Arshak Minasyan. All-in-one robust estimator of the gaussian mean. The Annals of Statistics, 2022
work page 2022
-
[66]
Performance of Q-learning with linear function approximation: Stability and finite-time analysis
Zaiwei Chen, Sheng Zhang, Thinh T Doan, Siva Theja Maguluri, and John-Paul Clarke. Performance of Q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, page 4, 2019
-
[67]
Finite-time error bounds for linear stochastic approximation and TD learning
Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD learning. InConference on Learning Theory, pages 2803–2830. PMLR, 2019
work page 2019
-
[68]
Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022
work page 2022
-
[69]
Gandharv Patil, LA Prashanth, Dheeraj Nagaraj, and Doina Precup. Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation. InInternational Conference on Artificial Intelligence and Statistics, pages 5438–5448. PMLR, 2023
work page 2023
-
[70]
A simple finite-time analysis of TD learning with linear function approximation
Aritra Mitra. A simple finite-time analysis of TD learning with linear function approximation. IEEE Transactions on Automatic Control, 70(2):1388–1394, 2024
work page 2024
-
[71]
Estimating the mixing time of ergodic markov chains
Geoffrey Wolfer and Aryeh Kontorovich. Estimating the mixing time of ergodic markov chains. InConference on Learning Theory, pages 3120–3159. PMLR, 2019
work page 2019
-
[72]
Daniel J Hsu, Aryeh Kontorovich, and Csaba Szepesvári. Mixing time estimation in reversible markov chains from a single sample path.Advances in neural information processing systems, 28, 2015
work page 2015
-
[73]
Robust estimation of discrete distributions under local differential privacy
Julien Chhor and Flore Sentenac. Robust estimation of discrete distributions under local differential privacy. InInternational Conference on Algorithmic Learning Theory, pages 411–446. PMLR, 2023
work page 2023
-
[74]
Robust online and distributed mean estimation under adversarial data corruption
Tong Yao and Shreyas Sundaram. Robust online and distributed mean estimation under adversarial data corruption. In2022 IEEE 61st Conference on Decision and Control (CDC), pages 4193–4198. IEEE, 2022. 23
work page 2022
-
[75]
Vinay Kanakeri and Aritra Mitra. Outlier-robust linear system identification under heavy-tailed noise.arXiv preprint arXiv:2501.00421, 2024
-
[76]
Vinay Kanakeri and Aritra Mitra. Boosting-enabled robust system identification of partially observed lti systems under heavy-tailed noise.arXiv preprint arXiv:2504.18444, 2025
-
[77]
Vivek S Borkar.Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009
work page 2009
-
[78]
Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000
work page 2000
-
[79]
Q-learning with nearest neighbors.Advances in Neural Information Processing Systems, 31, 2018
Devavrat Shah and Qiaomin Xie. Q-learning with nearest neighbors.Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[80]
LA Prashanth, Nathaniel Korda, and Rémi Munos. Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling. Machine Learning, 110(3):559–618, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.