On the Sample Complexity of Differentially Private Policy Optimization

Xingyu Zhou; Yi He

arxiv: 2510.21060 · v3 · pith:LUJ2KHPMnew · submitted 2025-10-24 · 💻 cs.LG · cs.AI

On the Sample Complexity of Differentially Private Policy Optimization

Yi He , Xingyu Zhou This is my paper

Pith reviewed 2026-05-18 04:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords differential privacypolicy optimizationreinforcement learningsample complexitypolicy gradientnatural policy gradient

0 comments

The pith

Differentially private policy optimization adds privacy costs only as lower-order terms in sample complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a definition of differential privacy suited to the on-policy nature of policy optimization in reinforcement learning. Using this definition, it provides sample complexity analyses for standard algorithms including policy gradient and natural policy gradient under a unified framework. The results show that enforcing privacy often increases sample needs only by lower-order terms. Readers care because these algorithms run in sensitive domains like healthcare and robotics where data privacy must be balanced against performance.

Core claim

The authors formalize a differential privacy definition for policy optimization that addresses on-policy learning dynamics and the choice of privacy unit. They analyze the sample complexity of policy gradient, natural policy gradient, and related algorithms under differential privacy constraints via a unified framework. The central result is that privacy costs typically manifest as lower-order terms in the sample complexity, accompanied by several subtle observations particular to private policy optimization settings.

What carries the argument

A unified analytical framework for deriving sample complexity bounds of differentially private policy optimization algorithms.

Load-bearing premise

The analysis depends on the newly formalized definition of differential privacy for on-policy learning dynamics and the specific choice of privacy unit.

What would settle it

An experiment or calculation showing that the additional samples required for a fixed privacy level grow linearly with the number of iterations or episodes, rather than as a lower-order additive or multiplicative term, would falsify the main claim.

read the original abstract

Policy optimization (PO) is a cornerstone of modern reinforcement learning (RL), with diverse applications spanning robotics, healthcare, and large language model training. The increasing deployment of PO in sensitive domains, however, raises significant privacy concerns. In this paper, we initiate a theoretical study of differentially private policy optimization, focusing explicitly on its sample complexity. We first formalize an appropriate definition of differential privacy (DP) tailored to PO, addressing the inherent challenges arising from on-policy learning dynamics and the subtlety involved in defining the unit of privacy. We then systematically analyze the sample complexity of widely-used PO algorithms, including policy gradient (PG), natural policy gradient (NPG) and more, under DP constraints and various settings, via a unified framework. Our theoretical results demonstrate that privacy costs can often manifest as lower-order terms in the sample complexity, while also highlighting subtle yet important observations in private PO settings. These offer valuable practical insights for privacy-preserving PO algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper launches the first unified sample-complexity analysis for differentially private policy optimization but builds everything on a custom DP definition whose fit with standard notions is still untested.

read the letter

This paper is the first to give a systematic sample-complexity treatment of differentially private policy optimization. It formalizes a DP notion that accounts for on-policy trajectories and the choice of privacy unit, then derives bounds for policy gradient, natural policy gradient, and variants under one framework. The central observation is that privacy costs often show up only as lower-order terms once that definition is fixed.

Referee Report

2 major / 2 minor

Summary. The paper initiates a theoretical study of differentially private policy optimization in reinforcement learning. It formalizes a DP definition tailored to on-policy learning dynamics and the choice of privacy unit, then analyzes the sample complexity of policy gradient (PG), natural policy gradient (NPG), and variants under DP constraints via a unified framework. The central claim is that privacy costs often manifest as lower-order terms in the sample complexity, with additional subtle observations in private PO settings.

Significance. If the results hold, the work provides practical insights for privacy-preserving PO in sensitive domains like healthcare and robotics by suggesting that DP can often be added with only lower-order sample overhead. The unified analysis across PG/NPG and the focus on on-policy dynamics are strengths; the paper merits credit for attempting a systematic treatment of privacy unit choice in trajectory-based learning.

major comments (2)

[§3 (DP definition for PO)] §3 (DP definition for PO): The custom DP notion, which redefines the privacy unit to account for on-policy trajectory dependence, is load-bearing for the entire sample-complexity analysis. The manuscript provides no formal reduction to standard (ε,δ)-DP, no composition theorem over iterative policy updates, and no verification that the neighboring relation prevents leakage during gradient estimation or policy improvement steps. Without this, the separation of privacy costs as lower-order terms does not necessarily follow.
[§4 (unified sample-complexity bounds)] §4 (unified sample-complexity bounds): The claim that privacy overhead is lower-order for both PG and NPG relies on the fixed custom DP definition and specific assumptions on how privacy noise interacts with on-policy sampling. The presentation does not explicitly derive or bound the interaction term between the privacy mechanism and the policy-update operator, leaving open whether the lower-order result holds under standard MDP assumptions or requires additional restrictions on the policy class or horizon.

minor comments (2)

[Abstract] The abstract refers to 'and more' algorithms without listing the variants analyzed; this should be made precise in the introduction or contributions paragraph.
[Notation and Preliminaries] Notation for privacy parameters (ε, δ) and sample terms should be introduced once and used consistently; occasional redefinition across sections reduces readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which help strengthen the rigor of our analysis. We appreciate the positive assessment of the paper's contributions in formalizing DP for policy optimization and analyzing sample complexity. We address each major comment below and will incorporate the suggested clarifications and additions in the revised manuscript.

read point-by-point responses

Referee: §3 (DP definition for PO): The custom DP notion, which redefines the privacy unit to account for on-policy trajectory dependence, is load-bearing for the entire sample-complexity analysis. The manuscript provides no formal reduction to standard (ε,δ)-DP, no composition theorem over iterative policy updates, and no verification that the neighboring relation prevents leakage during gradient estimation or policy improvement steps. Without this, the separation of privacy costs as lower-order terms does not necessarily follow.

Authors: We agree that establishing a formal link to standard DP strengthens the foundation. In the revision, we will add an appendix providing a formal reduction from our tailored DP definition to standard (ε,δ)-DP with respect to neighboring trajectories. We will also derive a composition theorem for the sequence of policy updates and explicitly verify that the chosen neighboring relation bounds sensitivity to prevent leakage in both gradient estimation and policy improvement steps. These additions will confirm that the lower-order privacy cost separation holds under the defined notion. revision: yes
Referee: §4 (unified sample-complexity bounds): The claim that privacy overhead is lower-order for both PG and NPG relies on the fixed custom DP definition and specific assumptions on how privacy noise interacts with on-policy sampling. The presentation does not explicitly derive or bound the interaction term between the privacy mechanism and the policy-update operator, leaving open whether the lower-order result holds under standard MDP assumptions or requires additional restrictions on the policy class or horizon.

Authors: We acknowledge the value of making the interaction term explicit. In the revised version, we will expand the unified analysis in Section 4 to derive and bound the interaction between the privacy noise and the policy-update operator. We will show that the lower-order overhead holds under standard MDP assumptions (bounded rewards, finite horizon, Lipschitz policies) without further restrictions on the policy class, and we will clarify any horizon-dependent conditions in the statements of the theorems. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation self-contained from new DP definition and standard primitives

full rationale

The paper introduces a custom DP definition for on-policy PO and derives sample-complexity bounds for PG/NPG variants under that definition. No quoted step reduces a claimed result to a fitted parameter, self-citation chain, or input by construction. The central claim that privacy appears as a lower-order term follows directly from the analysis once the neighboring relation and privacy unit are fixed; this is a standard first-principles derivation rather than a renaming or self-referential fit. The work is self-contained against external DP and RL benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view prevents full enumeration; the central claims rest on a custom DP definition for on-policy settings and standard RL convergence assumptions that are not detailed here.

axioms (1)

domain assumption A suitable differential privacy definition exists for on-policy policy optimization that respects learning dynamics and privacy unit choice.
Stated in the abstract as the first step before sample-complexity analysis.

pith-pipeline@v0.9.0 · 5684 in / 1127 out tokens · 31162 ms · 2026-05-18T04:40:49.628759+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 3 internal anchors

[1]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8:229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8:229–256, 1992

work page 1992
[2]

Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

work page 1999
[3]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

work page 2021
[6]

A general sample complexity analysis of vanilla policy gradient

Rui Yuan, Robert M Gower, and Alessandro Lazaric. A general sample complexity analysis of vanilla policy gradient. InInternational Conference on Artificial Intelligence and Statistics, pages 3332–3380. PMLR, 2022

work page 2022
[7]

Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps

Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. InProceedings of the AAAI Confer- ence on Artificial Intelligence, volume 34, pages 5668–5675, 2020

work page 2020
[8]

Neural trust region/proximal policy optimization attains globally optimal policy.Advances in neural information processing systems, 32, 2019

Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural trust region/proximal policy optimization attains globally optimal policy.Advances in neural information processing systems, 32, 2019

work page 2019
[9]

Calibrating noise to sensitivity in private data analysis

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. InTheory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer, 2006

work page 2006
[10]

A natural policy gradient.Advances in neural information processing systems, 14, 2001

Sham M Kakade. A natural policy gradient.Advances in neural information processing systems, 14, 2001

work page 2001
[11]

Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J

Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, and Wen Sun. Rebel: Reinforcement learning via regressing relative rewards, 2024. URL https://arxiv.org/abs/2404.16767

work page arXiv 2024
[12]

Differentially private policy gradient.arXiv preprint arXiv:2501.19080, 2025

Alexandre Rio, Merwan Barlier, and Igor Colin. Differentially private policy gradient.arXiv preprint arXiv:2501.19080, 2025

work page arXiv 2025
[13]

Differentially private regret minimization in episodic markov decision processes

Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6375–6383, 2022. 10

work page 2022
[14]

Differentially private reinforcement learning with linear function approximation

Xingyu Zhou. Differentially private reinforcement learning with linear function approximation. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(1):1–27, 2022

work page 2022
[15]

Differentially private policy evaluation

Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. In International Conference on Machine Learning, pages 2130–2138. PMLR, 2016

work page 2016
[16]

Privately aligning language models with reinforcement learning.arXiv preprint arXiv:2310.16960, 2023

Fan Wu, Huseyin A Inan, Arturs Backurs, Varun Chandrasekaran, Janardhan Kulkarni, and Robert Sim. Privately aligning language models with reinforcement learning.arXiv preprint arXiv:2310.16960, 2023

work page arXiv 2023
[17]

Regressing the relative future: Efficient policy optimization for multi-turn rlhf

Zhaolin Gao, Wenhao Zhan, Jonathan D Chang, Gokul Swamy, Kianté Brantley, Jason D Lee, and Wen Sun. Regressing the relative future: Efficient policy optimization for multi-turn rlhf. arXiv preprint arXiv:2410.04612, 2024

work page arXiv 2024
[18]

Spoiled for choice? personalized recommendation for healthcare decisions: A multiarmed bandit approach.Information Systems Research, 34(4):1493–1512, 2023

Tongxin Zhou, Yingfei Wang, Lu Yan, and Yong Tan. Spoiled for choice? personalized recommendation for healthcare decisions: A multiarmed bandit approach.Information Systems Research, 34(4):1493–1512, 2023

work page 2023
[19]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[20]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023
[21]

Differentially private empirical risk minimization.Journal of Machine Learning Research, 12(3), 2011

Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization.Journal of Machine Learning Research, 12(3), 2011

work page 2011
[22]

Private empirical risk minimization: Efficient algorithms and tight error bounds

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In2014 IEEE 55th annual symposium on foundations of computer science, pages 464–473. IEEE, 2014

work page 2014
[23]

Private stochastic convex optimization with optimal rates.Advances in neural information processing systems, 32, 2019

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates.Advances in neural information processing systems, 32, 2019

work page 2019
[24]

Private reinforcement learning with pac and regret guarantees

Giuseppe Vietri, Borja Balle, Akshay Krishnamurthy, and Steven Wu. Private reinforcement learning with pac and regret guarantees. InInternational Conference on Machine Learning, pages 9754–9764. PMLR, 2020

work page 2020
[25]

Private matchings and allocations.SIAM Journal on Computing, 45:1953–1984, 2016

J Hsu, Z Huang, A Roth, T Roughgarden, and ZS Wu. Private matchings and allocations.SIAM Journal on Computing, 45:1953–1984, 2016

work page 1953
[26]

Differentially private contextual linear bandits.Advances in Neural Information Processing Systems, 31, 2018

Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[27]

Near-optimal differentially private reinforcement learning

Dan Qiao and Yu-Xiang Wang. Near-optimal differentially private reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 9914–9940. PMLR, 2023

work page 2023
[28]

The algorithmic foundations of differential privacy.Founda- tions and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014

Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.Founda- tions and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014

work page 2014
[29]

On the global optimum convergence of momentum- based policy gradient

Yuhao Ding, Junzi Zhang, and Javad Lavaei. On the global optimum convergence of momentum- based policy gradient. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 ofProceedings of Machine Learning Research, pages 1910–1934. PMLR, 2...

work page 1910
[30]

An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods, 2022

Yanli Liu, Kaiqing Zhang, Tamer Ba¸ sar, and Wotao Yin. An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods, 2022. URL https://arxiv. org/abs/2211.07937. 11

work page arXiv 2022
[31]

Neural policy gradient methods: Global optimality and rates of convergence, 2019

Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence, 2019. URL https://arxiv.org/abs/1909. 01150

work page 2019
[32]

Hybrid rl: Using both offline and online data can make rl efficient,

Yuda Song, Yifei Zhou, Ayush Sekhari, J Andrew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718, 2022

work page arXiv 2022
[33]

Mechanism design via differential privacy

Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 94–103. IEEE, 2007

work page 2007
[34]

Insufficient statistics perturbation: Stable estimators for private least squares.arXiv preprint arXiv:2404.15409, 2024

Gavin Brown, Jonathan Hayase, Samuel Hopkins, Weihao Kong, Xiyang Liu, Sewoong Oh, Juan C Perdomo, and Adam Smith. Insufficient statistics perturbation: Stable estimators for private least squares.arXiv preprint arXiv:2404.15409, 2024

work page arXiv 2024
[35]

Near-optimal private learning in linear contextual bandits, 2025

Fan Chen, Jiachun Li, Alexander Rakhlin, and David Simchi-Levi. Near-optimal private learning in linear contextual bandits, 2025. URLhttps://arxiv.org/abs/2502.13115

work page arXiv 2025
[36]

Linear convergence of natural policy gradient methods with log-linear policies.arXiv preprint arXiv:2210.01400, 2022

Rui Yuan, Simon S Du, Robert M Gower, Alessandro Lazaric, and Lin Xiao. Linear convergence of natural policy gradient methods with log-linear policies.arXiv preprint arXiv:2210.01400, 2022

work page arXiv 2022
[37]

Stochastic variance-reduced policy gradient

Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. InInternational conference on machine learning, pages 4026–4035. PMLR, 2018

work page 2018
[38]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015
[39]

Global optimality guarantees for policy gradient methods

Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. Operations Research, 72(5):1906–1927, 2024

work page 1906
[40]

Neural proximal/trust region policy optimization attains globally optimal policy (2019).arXiv preprint arXiv:1906.10306, 1906

Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy (2019).arXiv preprint arXiv:1906.10306, 1906

work page arXiv 2019
[41]

An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020

Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020

work page 2020
[42]

Optimistic policy optimization with bandit feedback

Lior Shani, Yonathan Efroni, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. InInternational Conference on Machine Learning, pages 8604–8613. PMLR, 2020

work page 2020
[43]

Provably efficient exploration in policy optimization

Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. InInternational Conference on Machine Learning, pages 1283–1294. PMLR, 2020

work page 2020
[44]

Pc-pg: Policy cover directed exploration for provable policy gradient learning.Advances in neural information processing systems, 33:13399–13412, 2020

Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning.Advances in neural information processing systems, 33:13399–13412, 2020

work page 2020
[45]

Cautiously optimistic policy optimiza- tion and exploration with linear function approximation

Andrea Zanette, Ching-An Cheng, and Alekh Agarwal. Cautiously optimistic policy optimiza- tion and exploration with linear function approximation. InConference on Learning Theory, pages 4473–4525. PMLR, 2021

work page 2021
[46]

Optimistic natural policy gradient: a simple efficient policy optimization framework for online rl.Advances in Neural Information Processing Systems, 36:3560–3577, 2023

Qinghua Liu, Gellért Weisz, András György, Chi Jin, and Csaba Szepesvári. Optimistic natural policy gradient: a simple efficient policy optimization framework for online rl.Advances in Neural Information Processing Systems, 36:3560–3577, 2023

work page 2023
[47]

(nearly) optimal differentially private stochastic multi-arm bandits

Nikita Mishra and Abhradeep Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. InProceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 592–601, 2015. 12

work page 2015
[48]

An optimal private stochastic-mab algorithm based on optimal private stopping rule

Touqir Sajed and Or Sheffet. An optimal private stochastic-mab algorithm based on optimal private stopping rule. InInternational Conference on Machine Learning, pages 5579–5588. PMLR, 2019

work page 2019
[49]

Distributed differential privacy in multi-armed bandits.arXiv preprint arXiv:2206.05772, 2022

Sayak Ray Chowdhury and Xingyu Zhou. Distributed differential privacy in multi-armed bandits.arXiv preprint arXiv:2206.05772, 2022

work page arXiv 2022
[50]

Multi-armed bandits with local differential privacy.arXiv preprint arXiv:2007.03121, 2020

Wenbo Ren, Xingyu Zhou, Jia Liu, and Ness B Shroff. Multi-armed bandits with local differential privacy.arXiv preprint arXiv:2007.03121, 2020

work page arXiv 2007
[51]

On private and robust bandits.Advances in Neural Information Processing Systems, 36:34778–34790, 2023

Yulian Wu, Xingyu Zhou, Youming Tao, and Di Wang. On private and robust bandits.Advances in Neural Information Processing Systems, 36:34778–34790, 2023

work page 2023
[52]

A reduction from linear contextual bandits lower bounds to estimations lower bounds

Jiahao He, Jiheng Zhang, and Rachel Q Zhang. A reduction from linear contextual bandits lower bounds to estimations lower bounds. InInternational Conference on Machine Learning, pages 8660–8677. PMLR, 2022

work page 2022
[53]

Shuffle private linear contextual bandits.arXiv preprint arXiv:2202.05567, 2022

Sayak Ray Chowdhury and Xingyu Zhou. Shuffle private linear contextual bandits.arXiv preprint arXiv:2202.05567, 2022

work page arXiv 2022
[54]

On differentially private federated linear contextual bandits.arXiv preprint arXiv:2302.13945, 2023

Xingyu Zhou and Sayak Ray Chowdhury. On differentially private federated linear contextual bandits.arXiv preprint arXiv:2302.13945, 2023

work page arXiv 2023
[55]

Differentially private ex- ploration in reinforcement learning with linear representation.arXiv preprint arXiv:2112.01585, 2021

Paul Luyo, Evrard Garcelon, Alessandro Lazaric, and Matteo Pirotta. Differentially private ex- ploration in reinforcement learning with linear representation.arXiv preprint arXiv:2112.01585, 2021

work page arXiv 2021
[56]

Private stochastic convex optimization: optimal rates in linear time

Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. InProceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439–449, 2020

work page 2020
[57]

Private stochastic convex optimiza- tion: Optimal rates in l1 geometry

Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimiza- tion: Optimal rates in l1 geometry. InInternational Conference on Machine Learning, pages 393–403. PMLR, 2021

work page 2021
[58]

Non-euclidean differentially private stochastic convex optimization

Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. InConference on Learning Theory, pages 474–499. PMLR, 2021

work page 2021
[59]

Private non-smooth empirical risk minimiza- tion and stochastic convex optimization in subquadratic steps.arXiv preprint arXiv:2103.15352, 2021

Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth empirical risk minimiza- tion and stochastic convex optimization in subquadratic steps.arXiv preprint arXiv:2103.15352, 2021

work page arXiv 2021
[60]

Private federated learning without a trusted server: Optimal algorithms for convex losses

Andrew Lowy and Meisam Razaviyayn. Private federated learning without a trusted server: Optimal algorithms for convex losses. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[61]

Changyu Gao, Andrew Lowy, Xingyu Zhou, and Stephen J Wright. Private heterogeneous federated learning without a trusted server revisited: Error-optimal and communication-efficient algorithms for convex losses.arXiv preprint arXiv:2407.09690, 2024

work page arXiv 2024
[62]

Differentially private empirical risk minimization with non-convex loss functions

Di Wang, Changyou Chen, and Jinhui Xu. Differentially private empirical risk minimization with non-convex loss functions. InInternational Conference on Machine Learning, pages 6526–6535. PMLR, 2019

work page 2019
[63]

Private stochastic non-convex optimization: Adaptive algorithms and tighter generalization bounds

Yingxue Zhou, Xiangyi Chen, Mingyi Hong, Zhiwei Steven Wu, and Arindam Banerjee. Private stochastic non-convex optimization: Adaptive algorithms and tighter generalization bounds. arXiv preprint arXiv:2006.13501, 2020

work page arXiv 2006
[64]

Faster rates of convergence to stationary points in differentially private optimization

Raman Arora, Raef Bassily, Tomás González, Cristóbal A Guzmán, Michael Menart, and Enayat Ullah. Faster rates of convergence to stationary points in differentially private optimization. In International Conference on Machine Learning, pages 1060–1092. PMLR, 2023. 13

work page 2023
[65]

Private (stochastic) non-convex optimization revisited: Second-order stationary points and excess risks.Advances in Neural Information Processing Systems, 36:65618–65641, 2023

Daogao Liu, Arun Ganesh, Sewoong Oh, and Abhradeep Guha Thakurta. Private (stochastic) non-convex optimization revisited: Second-order stationary points and excess risks.Advances in Neural Information Processing Systems, 36:65618–65641, 2023

work page 2023
[66]

Adaptive batch size for privately finding second-order stationary points.arXiv preprint arXiv:2410.07502, 2024

Daogao Liu and Kunal Talwar. Adaptive batch size for privately finding second-order stationary points.arXiv preprint arXiv:2410.07502, 2024

work page arXiv 2024
[67]

Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain

Yu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain.arXiv preprint arXiv:1803.02596, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[68]

Old techniques in differentially private linear regression

Or Sheffet. Old techniques in differentially private linear regression. InAlgorithmic Learning Theory, pages 789–827. PMLR, 2019

work page 2019
[69]

The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy.The Annals of Statistics, 49(5):2825–2850, 2021

T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy.The Annals of Statistics, 49(5):2825–2850, 2021

work page 2021
[70]

Private gradient descent for linear regression: Tighter error bounds and instance-specific uncertainty estimation.arXiv preprint arXiv:2402.13531, 2024

Gavin Brown, Krishnamurthy Dvijotham, Georgina Evans, Daogao Liu, Adam Smith, and Abhradeep Thakurta. Private gradient descent for linear regression: Tighter error bounds and instance-specific uncertainty estimation.arXiv preprint arXiv:2402.13531, 2024

work page arXiv 2024
[71]

(nearly) optimal private linear regression for sub-gaussian data via adaptive clipping

Prateek Varshney, Abhradeep Thakurta, and Prateek Jain. (nearly) optimal private linear regression for sub-gaussian data via adaptive clipping. InConference on Learning Theory, pages 1126–1166. PMLR, 2022

work page 2022
[72]

Label robust and differentially private linear regression: Computational and statistical efficiency.Advances in Neural Information Processing Systems, 36:23019–23033, 2023

Xiyang Liu, Prateek Jain, Weihao Kong, Sewoong Oh, and Arun Suggala. Label robust and differentially private linear regression: Computational and statistical efficiency.Advances in Neural Information Processing Systems, 36:23019–23033, 2023

work page 2023
[73]

Sample-optimal private regression in polynomial time.arXiv preprint arXiv:2503.24321, 2025

Prashanti Anderson, Ainesh Bakshi, Mahbod Majid, and Stefan Tiegel. Sample-optimal private regression in polynomial time.arXiv preprint arXiv:2503.24321, 2025

work page arXiv 2025
[74]

Limitations

Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making.arXiv preprint arXiv:2112.13487, 2021. NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: : In the a...

work page arXiv 2021
[75]

e∇mJ(θ) + λ |X | 1 |Y| −[π x(θ)]x∈X 2# =E

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025
[76]

who use PPO instead of NPG, and (iii) as privacy budget decreases (smaller ε), performance degrades as expected from theory. 36 AlgorithmεMean Final Reward Std Final Reward Best Epoch Mean PG N/A 334.37 25.25 361.70 DP-PG5 190.34 52.91 199.17 DP-PG3 143.87 22.88 187.17 NPG N/A 492.90 10.04 500.00 DP-NPG5 478.73 28.05 494.70 DP-NPG3 400.87 65.37 410.43 Tab...

work page

[1] [1]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8:229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8:229–256, 1992

work page 1992

[2] [2]

Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

work page 1999

[3] [3]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

work page 2021

[6] [6]

A general sample complexity analysis of vanilla policy gradient

Rui Yuan, Robert M Gower, and Alessandro Lazaric. A general sample complexity analysis of vanilla policy gradient. InInternational Conference on Artificial Intelligence and Statistics, pages 3332–3380. PMLR, 2022

work page 2022

[7] [7]

Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps

Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. InProceedings of the AAAI Confer- ence on Artificial Intelligence, volume 34, pages 5668–5675, 2020

work page 2020

[8] [8]

Neural trust region/proximal policy optimization attains globally optimal policy.Advances in neural information processing systems, 32, 2019

Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural trust region/proximal policy optimization attains globally optimal policy.Advances in neural information processing systems, 32, 2019

work page 2019

[9] [9]

Calibrating noise to sensitivity in private data analysis

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. InTheory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer, 2006

work page 2006

[10] [10]

A natural policy gradient.Advances in neural information processing systems, 14, 2001

Sham M Kakade. A natural policy gradient.Advances in neural information processing systems, 14, 2001

work page 2001

[11] [11]

Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J

Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, and Wen Sun. Rebel: Reinforcement learning via regressing relative rewards, 2024. URL https://arxiv.org/abs/2404.16767

work page arXiv 2024

[12] [12]

Differentially private policy gradient.arXiv preprint arXiv:2501.19080, 2025

Alexandre Rio, Merwan Barlier, and Igor Colin. Differentially private policy gradient.arXiv preprint arXiv:2501.19080, 2025

work page arXiv 2025

[13] [13]

Differentially private regret minimization in episodic markov decision processes

Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6375–6383, 2022. 10

work page 2022

[14] [14]

Differentially private reinforcement learning with linear function approximation

Xingyu Zhou. Differentially private reinforcement learning with linear function approximation. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(1):1–27, 2022

work page 2022

[15] [15]

Differentially private policy evaluation

Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. In International Conference on Machine Learning, pages 2130–2138. PMLR, 2016

work page 2016

[16] [16]

Privately aligning language models with reinforcement learning.arXiv preprint arXiv:2310.16960, 2023

Fan Wu, Huseyin A Inan, Arturs Backurs, Varun Chandrasekaran, Janardhan Kulkarni, and Robert Sim. Privately aligning language models with reinforcement learning.arXiv preprint arXiv:2310.16960, 2023

work page arXiv 2023

[17] [17]

Regressing the relative future: Efficient policy optimization for multi-turn rlhf

Zhaolin Gao, Wenhao Zhan, Jonathan D Chang, Gokul Swamy, Kianté Brantley, Jason D Lee, and Wen Sun. Regressing the relative future: Efficient policy optimization for multi-turn rlhf. arXiv preprint arXiv:2410.04612, 2024

work page arXiv 2024

[18] [18]

Spoiled for choice? personalized recommendation for healthcare decisions: A multiarmed bandit approach.Information Systems Research, 34(4):1493–1512, 2023

Tongxin Zhou, Yingfei Wang, Lu Yan, and Yong Tan. Spoiled for choice? personalized recommendation for healthcare decisions: A multiarmed bandit approach.Information Systems Research, 34(4):1493–1512, 2023

work page 2023

[19] [19]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[20] [20]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023

[21] [21]

Differentially private empirical risk minimization.Journal of Machine Learning Research, 12(3), 2011

Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization.Journal of Machine Learning Research, 12(3), 2011

work page 2011

[22] [22]

Private empirical risk minimization: Efficient algorithms and tight error bounds

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In2014 IEEE 55th annual symposium on foundations of computer science, pages 464–473. IEEE, 2014

work page 2014

[23] [23]

Private stochastic convex optimization with optimal rates.Advances in neural information processing systems, 32, 2019

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates.Advances in neural information processing systems, 32, 2019

work page 2019

[24] [24]

Private reinforcement learning with pac and regret guarantees

Giuseppe Vietri, Borja Balle, Akshay Krishnamurthy, and Steven Wu. Private reinforcement learning with pac and regret guarantees. InInternational Conference on Machine Learning, pages 9754–9764. PMLR, 2020

work page 2020

[25] [25]

Private matchings and allocations.SIAM Journal on Computing, 45:1953–1984, 2016

J Hsu, Z Huang, A Roth, T Roughgarden, and ZS Wu. Private matchings and allocations.SIAM Journal on Computing, 45:1953–1984, 2016

work page 1953

[26] [26]

Differentially private contextual linear bandits.Advances in Neural Information Processing Systems, 31, 2018

Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits.Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[27] [27]

Near-optimal differentially private reinforcement learning

Dan Qiao and Yu-Xiang Wang. Near-optimal differentially private reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 9914–9940. PMLR, 2023

work page 2023

[28] [28]

The algorithmic foundations of differential privacy.Founda- tions and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014

Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.Founda- tions and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014

work page 2014

[29] [29]

On the global optimum convergence of momentum- based policy gradient

Yuhao Ding, Junzi Zhang, and Javad Lavaei. On the global optimum convergence of momentum- based policy gradient. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 ofProceedings of Machine Learning Research, pages 1910–1934. PMLR, 2...

work page 1910

[30] [30]

An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods, 2022

Yanli Liu, Kaiqing Zhang, Tamer Ba¸ sar, and Wotao Yin. An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods, 2022. URL https://arxiv. org/abs/2211.07937. 11

work page arXiv 2022

[31] [31]

Neural policy gradient methods: Global optimality and rates of convergence, 2019

Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence, 2019. URL https://arxiv.org/abs/1909. 01150

work page 2019

[32] [32]

Hybrid rl: Using both offline and online data can make rl efficient,

Yuda Song, Yifei Zhou, Ayush Sekhari, J Andrew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718, 2022

work page arXiv 2022

[33] [33]

Mechanism design via differential privacy

Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 94–103. IEEE, 2007

work page 2007

[34] [34]

Insufficient statistics perturbation: Stable estimators for private least squares.arXiv preprint arXiv:2404.15409, 2024

Gavin Brown, Jonathan Hayase, Samuel Hopkins, Weihao Kong, Xiyang Liu, Sewoong Oh, Juan C Perdomo, and Adam Smith. Insufficient statistics perturbation: Stable estimators for private least squares.arXiv preprint arXiv:2404.15409, 2024

work page arXiv 2024

[35] [35]

Near-optimal private learning in linear contextual bandits, 2025

Fan Chen, Jiachun Li, Alexander Rakhlin, and David Simchi-Levi. Near-optimal private learning in linear contextual bandits, 2025. URLhttps://arxiv.org/abs/2502.13115

work page arXiv 2025

[36] [36]

Linear convergence of natural policy gradient methods with log-linear policies.arXiv preprint arXiv:2210.01400, 2022

Rui Yuan, Simon S Du, Robert M Gower, Alessandro Lazaric, and Lin Xiao. Linear convergence of natural policy gradient methods with log-linear policies.arXiv preprint arXiv:2210.01400, 2022

work page arXiv 2022

[37] [37]

Stochastic variance-reduced policy gradient

Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. InInternational conference on machine learning, pages 4026–4035. PMLR, 2018

work page 2018

[38] [38]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015

[39] [39]

Global optimality guarantees for policy gradient methods

Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. Operations Research, 72(5):1906–1927, 2024

work page 1906

[40] [40]

Neural proximal/trust region policy optimization attains globally optimal policy (2019).arXiv preprint arXiv:1906.10306, 1906

Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy (2019).arXiv preprint arXiv:1906.10306, 1906

work page arXiv 2019

[41] [41]

An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020

Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020

work page 2020

[42] [42]

Optimistic policy optimization with bandit feedback

Lior Shani, Yonathan Efroni, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. InInternational Conference on Machine Learning, pages 8604–8613. PMLR, 2020

work page 2020

[43] [43]

Provably efficient exploration in policy optimization

Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. InInternational Conference on Machine Learning, pages 1283–1294. PMLR, 2020

work page 2020

[44] [44]

Pc-pg: Policy cover directed exploration for provable policy gradient learning.Advances in neural information processing systems, 33:13399–13412, 2020

Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning.Advances in neural information processing systems, 33:13399–13412, 2020

work page 2020

[45] [45]

Cautiously optimistic policy optimiza- tion and exploration with linear function approximation

Andrea Zanette, Ching-An Cheng, and Alekh Agarwal. Cautiously optimistic policy optimiza- tion and exploration with linear function approximation. InConference on Learning Theory, pages 4473–4525. PMLR, 2021

work page 2021

[46] [46]

Optimistic natural policy gradient: a simple efficient policy optimization framework for online rl.Advances in Neural Information Processing Systems, 36:3560–3577, 2023

Qinghua Liu, Gellért Weisz, András György, Chi Jin, and Csaba Szepesvári. Optimistic natural policy gradient: a simple efficient policy optimization framework for online rl.Advances in Neural Information Processing Systems, 36:3560–3577, 2023

work page 2023

[47] [47]

(nearly) optimal differentially private stochastic multi-arm bandits

Nikita Mishra and Abhradeep Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. InProceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 592–601, 2015. 12

work page 2015

[48] [48]

An optimal private stochastic-mab algorithm based on optimal private stopping rule

Touqir Sajed and Or Sheffet. An optimal private stochastic-mab algorithm based on optimal private stopping rule. InInternational Conference on Machine Learning, pages 5579–5588. PMLR, 2019

work page 2019

[49] [49]

Distributed differential privacy in multi-armed bandits.arXiv preprint arXiv:2206.05772, 2022

Sayak Ray Chowdhury and Xingyu Zhou. Distributed differential privacy in multi-armed bandits.arXiv preprint arXiv:2206.05772, 2022

work page arXiv 2022

[50] [50]

Multi-armed bandits with local differential privacy.arXiv preprint arXiv:2007.03121, 2020

Wenbo Ren, Xingyu Zhou, Jia Liu, and Ness B Shroff. Multi-armed bandits with local differential privacy.arXiv preprint arXiv:2007.03121, 2020

work page arXiv 2007

[51] [51]

On private and robust bandits.Advances in Neural Information Processing Systems, 36:34778–34790, 2023

Yulian Wu, Xingyu Zhou, Youming Tao, and Di Wang. On private and robust bandits.Advances in Neural Information Processing Systems, 36:34778–34790, 2023

work page 2023

[52] [52]

A reduction from linear contextual bandits lower bounds to estimations lower bounds

Jiahao He, Jiheng Zhang, and Rachel Q Zhang. A reduction from linear contextual bandits lower bounds to estimations lower bounds. InInternational Conference on Machine Learning, pages 8660–8677. PMLR, 2022

work page 2022

[53] [53]

Shuffle private linear contextual bandits.arXiv preprint arXiv:2202.05567, 2022

Sayak Ray Chowdhury and Xingyu Zhou. Shuffle private linear contextual bandits.arXiv preprint arXiv:2202.05567, 2022

work page arXiv 2022

[54] [54]

On differentially private federated linear contextual bandits.arXiv preprint arXiv:2302.13945, 2023

Xingyu Zhou and Sayak Ray Chowdhury. On differentially private federated linear contextual bandits.arXiv preprint arXiv:2302.13945, 2023

work page arXiv 2023

[55] [55]

Differentially private ex- ploration in reinforcement learning with linear representation.arXiv preprint arXiv:2112.01585, 2021

Paul Luyo, Evrard Garcelon, Alessandro Lazaric, and Matteo Pirotta. Differentially private ex- ploration in reinforcement learning with linear representation.arXiv preprint arXiv:2112.01585, 2021

work page arXiv 2021

[56] [56]

Private stochastic convex optimization: optimal rates in linear time

Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. InProceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439–449, 2020

work page 2020

[57] [57]

Private stochastic convex optimiza- tion: Optimal rates in l1 geometry

Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimiza- tion: Optimal rates in l1 geometry. InInternational Conference on Machine Learning, pages 393–403. PMLR, 2021

work page 2021

[58] [58]

Non-euclidean differentially private stochastic convex optimization

Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. InConference on Learning Theory, pages 474–499. PMLR, 2021

work page 2021

[59] [59]

Private non-smooth empirical risk minimiza- tion and stochastic convex optimization in subquadratic steps.arXiv preprint arXiv:2103.15352, 2021

Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth empirical risk minimiza- tion and stochastic convex optimization in subquadratic steps.arXiv preprint arXiv:2103.15352, 2021

work page arXiv 2021

[60] [60]

Private federated learning without a trusted server: Optimal algorithms for convex losses

Andrew Lowy and Meisam Razaviyayn. Private federated learning without a trusted server: Optimal algorithms for convex losses. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[61] [61]

Changyu Gao, Andrew Lowy, Xingyu Zhou, and Stephen J Wright. Private heterogeneous federated learning without a trusted server revisited: Error-optimal and communication-efficient algorithms for convex losses.arXiv preprint arXiv:2407.09690, 2024

work page arXiv 2024

[62] [62]

Differentially private empirical risk minimization with non-convex loss functions

Di Wang, Changyou Chen, and Jinhui Xu. Differentially private empirical risk minimization with non-convex loss functions. InInternational Conference on Machine Learning, pages 6526–6535. PMLR, 2019

work page 2019

[63] [63]

Private stochastic non-convex optimization: Adaptive algorithms and tighter generalization bounds

Yingxue Zhou, Xiangyi Chen, Mingyi Hong, Zhiwei Steven Wu, and Arindam Banerjee. Private stochastic non-convex optimization: Adaptive algorithms and tighter generalization bounds. arXiv preprint arXiv:2006.13501, 2020

work page arXiv 2006

[64] [64]

Faster rates of convergence to stationary points in differentially private optimization

Raman Arora, Raef Bassily, Tomás González, Cristóbal A Guzmán, Michael Menart, and Enayat Ullah. Faster rates of convergence to stationary points in differentially private optimization. In International Conference on Machine Learning, pages 1060–1092. PMLR, 2023. 13

work page 2023

[65] [65]

Private (stochastic) non-convex optimization revisited: Second-order stationary points and excess risks.Advances in Neural Information Processing Systems, 36:65618–65641, 2023

Daogao Liu, Arun Ganesh, Sewoong Oh, and Abhradeep Guha Thakurta. Private (stochastic) non-convex optimization revisited: Second-order stationary points and excess risks.Advances in Neural Information Processing Systems, 36:65618–65641, 2023

work page 2023

[66] [66]

Adaptive batch size for privately finding second-order stationary points.arXiv preprint arXiv:2410.07502, 2024

Daogao Liu and Kunal Talwar. Adaptive batch size for privately finding second-order stationary points.arXiv preprint arXiv:2410.07502, 2024

work page arXiv 2024

[67] [67]

Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain

Yu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain.arXiv preprint arXiv:1803.02596, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[68] [68]

Old techniques in differentially private linear regression

Or Sheffet. Old techniques in differentially private linear regression. InAlgorithmic Learning Theory, pages 789–827. PMLR, 2019

work page 2019

[69] [69]

The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy.The Annals of Statistics, 49(5):2825–2850, 2021

T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy.The Annals of Statistics, 49(5):2825–2850, 2021

work page 2021

[70] [70]

Private gradient descent for linear regression: Tighter error bounds and instance-specific uncertainty estimation.arXiv preprint arXiv:2402.13531, 2024

Gavin Brown, Krishnamurthy Dvijotham, Georgina Evans, Daogao Liu, Adam Smith, and Abhradeep Thakurta. Private gradient descent for linear regression: Tighter error bounds and instance-specific uncertainty estimation.arXiv preprint arXiv:2402.13531, 2024

work page arXiv 2024

[71] [71]

(nearly) optimal private linear regression for sub-gaussian data via adaptive clipping

Prateek Varshney, Abhradeep Thakurta, and Prateek Jain. (nearly) optimal private linear regression for sub-gaussian data via adaptive clipping. InConference on Learning Theory, pages 1126–1166. PMLR, 2022

work page 2022

[72] [72]

Label robust and differentially private linear regression: Computational and statistical efficiency.Advances in Neural Information Processing Systems, 36:23019–23033, 2023

Xiyang Liu, Prateek Jain, Weihao Kong, Sewoong Oh, and Arun Suggala. Label robust and differentially private linear regression: Computational and statistical efficiency.Advances in Neural Information Processing Systems, 36:23019–23033, 2023

work page 2023

[73] [73]

Sample-optimal private regression in polynomial time.arXiv preprint arXiv:2503.24321, 2025

Prashanti Anderson, Ainesh Bakshi, Mahbod Majid, and Stefan Tiegel. Sample-optimal private regression in polynomial time.arXiv preprint arXiv:2503.24321, 2025

work page arXiv 2025

[74] [74]

Limitations

Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making.arXiv preprint arXiv:2112.13487, 2021. NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: : In the a...

work page arXiv 2021

[75] [75]

e∇mJ(θ) + λ |X | 1 |Y| −[π x(θ)]x∈X 2# =E

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025

[76] [76]

who use PPO instead of NPG, and (iii) as privacy budget decreases (smaller ε), performance degrades as expected from theory. 36 AlgorithmεMean Final Reward Std Final Reward Best Epoch Mean PG N/A 334.37 25.25 361.70 DP-PG5 190.34 52.91 199.17 DP-PG3 143.87 22.88 187.17 NPG N/A 492.90 10.04 500.00 DP-NPG5 478.73 28.05 494.70 DP-NPG3 400.87 65.37 410.43 Tab...

work page