pith. sign in

arxiv: 2510.21060 · v3 · pith:LUJ2KHPMnew · submitted 2025-10-24 · 💻 cs.LG · cs.AI

On the Sample Complexity of Differentially Private Policy Optimization

Pith reviewed 2026-05-18 04:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords differential privacypolicy optimizationreinforcement learningsample complexitypolicy gradientnatural policy gradient
0
0 comments X

The pith

Differentially private policy optimization adds privacy costs only as lower-order terms in sample complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a definition of differential privacy suited to the on-policy nature of policy optimization in reinforcement learning. Using this definition, it provides sample complexity analyses for standard algorithms including policy gradient and natural policy gradient under a unified framework. The results show that enforcing privacy often increases sample needs only by lower-order terms. Readers care because these algorithms run in sensitive domains like healthcare and robotics where data privacy must be balanced against performance.

Core claim

The authors formalize a differential privacy definition for policy optimization that addresses on-policy learning dynamics and the choice of privacy unit. They analyze the sample complexity of policy gradient, natural policy gradient, and related algorithms under differential privacy constraints via a unified framework. The central result is that privacy costs typically manifest as lower-order terms in the sample complexity, accompanied by several subtle observations particular to private policy optimization settings.

What carries the argument

A unified analytical framework for deriving sample complexity bounds of differentially private policy optimization algorithms.

Load-bearing premise

The analysis depends on the newly formalized definition of differential privacy for on-policy learning dynamics and the specific choice of privacy unit.

What would settle it

An experiment or calculation showing that the additional samples required for a fixed privacy level grow linearly with the number of iterations or episodes, rather than as a lower-order additive or multiplicative term, would falsify the main claim.

read the original abstract

Policy optimization (PO) is a cornerstone of modern reinforcement learning (RL), with diverse applications spanning robotics, healthcare, and large language model training. The increasing deployment of PO in sensitive domains, however, raises significant privacy concerns. In this paper, we initiate a theoretical study of differentially private policy optimization, focusing explicitly on its sample complexity. We first formalize an appropriate definition of differential privacy (DP) tailored to PO, addressing the inherent challenges arising from on-policy learning dynamics and the subtlety involved in defining the unit of privacy. We then systematically analyze the sample complexity of widely-used PO algorithms, including policy gradient (PG), natural policy gradient (NPG) and more, under DP constraints and various settings, via a unified framework. Our theoretical results demonstrate that privacy costs can often manifest as lower-order terms in the sample complexity, while also highlighting subtle yet important observations in private PO settings. These offer valuable practical insights for privacy-preserving PO algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper initiates a theoretical study of differentially private policy optimization in reinforcement learning. It formalizes a DP definition tailored to on-policy learning dynamics and the choice of privacy unit, then analyzes the sample complexity of policy gradient (PG), natural policy gradient (NPG), and variants under DP constraints via a unified framework. The central claim is that privacy costs often manifest as lower-order terms in the sample complexity, with additional subtle observations in private PO settings.

Significance. If the results hold, the work provides practical insights for privacy-preserving PO in sensitive domains like healthcare and robotics by suggesting that DP can often be added with only lower-order sample overhead. The unified analysis across PG/NPG and the focus on on-policy dynamics are strengths; the paper merits credit for attempting a systematic treatment of privacy unit choice in trajectory-based learning.

major comments (2)
  1. [§3 (DP definition for PO)] §3 (DP definition for PO): The custom DP notion, which redefines the privacy unit to account for on-policy trajectory dependence, is load-bearing for the entire sample-complexity analysis. The manuscript provides no formal reduction to standard (ε,δ)-DP, no composition theorem over iterative policy updates, and no verification that the neighboring relation prevents leakage during gradient estimation or policy improvement steps. Without this, the separation of privacy costs as lower-order terms does not necessarily follow.
  2. [§4 (unified sample-complexity bounds)] §4 (unified sample-complexity bounds): The claim that privacy overhead is lower-order for both PG and NPG relies on the fixed custom DP definition and specific assumptions on how privacy noise interacts with on-policy sampling. The presentation does not explicitly derive or bound the interaction term between the privacy mechanism and the policy-update operator, leaving open whether the lower-order result holds under standard MDP assumptions or requires additional restrictions on the policy class or horizon.
minor comments (2)
  1. [Abstract] The abstract refers to 'and more' algorithms without listing the variants analyzed; this should be made precise in the introduction or contributions paragraph.
  2. [Notation and Preliminaries] Notation for privacy parameters (ε, δ) and sample terms should be introduced once and used consistently; occasional redefinition across sections reduces readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which help strengthen the rigor of our analysis. We appreciate the positive assessment of the paper's contributions in formalizing DP for policy optimization and analyzing sample complexity. We address each major comment below and will incorporate the suggested clarifications and additions in the revised manuscript.

read point-by-point responses
  1. Referee: §3 (DP definition for PO): The custom DP notion, which redefines the privacy unit to account for on-policy trajectory dependence, is load-bearing for the entire sample-complexity analysis. The manuscript provides no formal reduction to standard (ε,δ)-DP, no composition theorem over iterative policy updates, and no verification that the neighboring relation prevents leakage during gradient estimation or policy improvement steps. Without this, the separation of privacy costs as lower-order terms does not necessarily follow.

    Authors: We agree that establishing a formal link to standard DP strengthens the foundation. In the revision, we will add an appendix providing a formal reduction from our tailored DP definition to standard (ε,δ)-DP with respect to neighboring trajectories. We will also derive a composition theorem for the sequence of policy updates and explicitly verify that the chosen neighboring relation bounds sensitivity to prevent leakage in both gradient estimation and policy improvement steps. These additions will confirm that the lower-order privacy cost separation holds under the defined notion. revision: yes

  2. Referee: §4 (unified sample-complexity bounds): The claim that privacy overhead is lower-order for both PG and NPG relies on the fixed custom DP definition and specific assumptions on how privacy noise interacts with on-policy sampling. The presentation does not explicitly derive or bound the interaction term between the privacy mechanism and the policy-update operator, leaving open whether the lower-order result holds under standard MDP assumptions or requires additional restrictions on the policy class or horizon.

    Authors: We acknowledge the value of making the interaction term explicit. In the revised version, we will expand the unified analysis in Section 4 to derive and bound the interaction between the privacy noise and the policy-update operator. We will show that the lower-order overhead holds under standard MDP assumptions (bounded rewards, finite horizon, Lipschitz policies) without further restrictions on the policy class, and we will clarify any horizon-dependent conditions in the statements of the theorems. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation self-contained from new DP definition and standard primitives

full rationale

The paper introduces a custom DP definition for on-policy PO and derives sample-complexity bounds for PG/NPG variants under that definition. No quoted step reduces a claimed result to a fitted parameter, self-citation chain, or input by construction. The central claim that privacy appears as a lower-order term follows directly from the analysis once the neighboring relation and privacy unit are fixed; this is a standard first-principles derivation rather than a renaming or self-referential fit. The work is self-contained against external DP and RL benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view prevents full enumeration; the central claims rest on a custom DP definition for on-policy settings and standard RL convergence assumptions that are not detailed here.

axioms (1)
  • domain assumption A suitable differential privacy definition exists for on-policy policy optimization that respects learning dynamics and privacy unit choice.
    Stated in the abstract as the first step before sample-complexity analysis.

pith-pipeline@v0.9.0 · 5684 in / 1127 out tokens · 31162 ms · 2026-05-18T04:40:49.628759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 3 internal anchors

  1. [1]

    Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8:229–256, 1992

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8:229–256, 1992

  2. [2]

    Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

  3. [3]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  4. [4]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  5. [5]

    On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

    Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

  6. [6]

    A general sample complexity analysis of vanilla policy gradient

    Rui Yuan, Robert M Gower, and Alessandro Lazaric. A general sample complexity analysis of vanilla policy gradient. InInternational Conference on Artificial Intelligence and Statistics, pages 3332–3380. PMLR, 2022

  7. [7]

    Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps

    Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. InProceedings of the AAAI Confer- ence on Artificial Intelligence, volume 34, pages 5668–5675, 2020

  8. [8]

    Neural trust region/proximal policy optimization attains globally optimal policy.Advances in neural information processing systems, 32, 2019

    Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural trust region/proximal policy optimization attains globally optimal policy.Advances in neural information processing systems, 32, 2019

  9. [9]

    Calibrating noise to sensitivity in private data analysis

    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. InTheory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer, 2006

  10. [10]

    A natural policy gradient.Advances in neural information processing systems, 14, 2001

    Sham M Kakade. A natural policy gradient.Advances in neural information processing systems, 14, 2001

  11. [11]

    Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J

    Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, and Wen Sun. Rebel: Reinforcement learning via regressing relative rewards, 2024. URL https://arxiv.org/abs/2404.16767

  12. [12]

    Differentially private policy gradient.arXiv preprint arXiv:2501.19080, 2025

    Alexandre Rio, Merwan Barlier, and Igor Colin. Differentially private policy gradient.arXiv preprint arXiv:2501.19080, 2025

  13. [13]

    Differentially private regret minimization in episodic markov decision processes

    Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6375–6383, 2022. 10

  14. [14]

    Differentially private reinforcement learning with linear function approximation

    Xingyu Zhou. Differentially private reinforcement learning with linear function approximation. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(1):1–27, 2022

  15. [15]

    Differentially private policy evaluation

    Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. In International Conference on Machine Learning, pages 2130–2138. PMLR, 2016

  16. [16]

    Privately aligning language models with reinforcement learning.arXiv preprint arXiv:2310.16960, 2023

    Fan Wu, Huseyin A Inan, Arturs Backurs, Varun Chandrasekaran, Janardhan Kulkarni, and Robert Sim. Privately aligning language models with reinforcement learning.arXiv preprint arXiv:2310.16960, 2023

  17. [17]

    Regressing the relative future: Efficient policy optimization for multi-turn rlhf

    Zhaolin Gao, Wenhao Zhan, Jonathan D Chang, Gokul Swamy, Kianté Brantley, Jason D Lee, and Wen Sun. Regressing the relative future: Efficient policy optimization for multi-turn rlhf. arXiv preprint arXiv:2410.04612, 2024

  18. [18]

    Spoiled for choice? personalized recommendation for healthcare decisions: A multiarmed bandit approach.Information Systems Research, 34(4):1493–1512, 2023

    Tongxin Zhou, Yingfei Wang, Lu Yan, and Yong Tan. Spoiled for choice? personalized recommendation for healthcare decisions: A multiarmed bandit approach.Information Systems Research, 34(4):1493–1512, 2023

  19. [19]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  20. [20]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

  21. [21]

    Differentially private empirical risk minimization.Journal of Machine Learning Research, 12(3), 2011

    Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization.Journal of Machine Learning Research, 12(3), 2011

  22. [22]

    Private empirical risk minimization: Efficient algorithms and tight error bounds

    Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In2014 IEEE 55th annual symposium on foundations of computer science, pages 464–473. IEEE, 2014

  23. [23]

    Private stochastic convex optimization with optimal rates.Advances in neural information processing systems, 32, 2019

    Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates.Advances in neural information processing systems, 32, 2019

  24. [24]

    Private reinforcement learning with pac and regret guarantees

    Giuseppe Vietri, Borja Balle, Akshay Krishnamurthy, and Steven Wu. Private reinforcement learning with pac and regret guarantees. InInternational Conference on Machine Learning, pages 9754–9764. PMLR, 2020

  25. [25]

    Private matchings and allocations.SIAM Journal on Computing, 45:1953–1984, 2016

    J Hsu, Z Huang, A Roth, T Roughgarden, and ZS Wu. Private matchings and allocations.SIAM Journal on Computing, 45:1953–1984, 2016

  26. [26]

    Differentially private contextual linear bandits.Advances in Neural Information Processing Systems, 31, 2018

    Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits.Advances in Neural Information Processing Systems, 31, 2018

  27. [27]

    Near-optimal differentially private reinforcement learning

    Dan Qiao and Yu-Xiang Wang. Near-optimal differentially private reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 9914–9940. PMLR, 2023

  28. [28]

    The algorithmic foundations of differential privacy.Founda- tions and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014

    Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.Founda- tions and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014

  29. [29]

    On the global optimum convergence of momentum- based policy gradient

    Yuhao Ding, Junzi Zhang, and Javad Lavaei. On the global optimum convergence of momentum- based policy gradient. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 ofProceedings of Machine Learning Research, pages 1910–1934. PMLR, 2...

  30. [30]

    An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods, 2022

    Yanli Liu, Kaiqing Zhang, Tamer Ba¸ sar, and Wotao Yin. An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods, 2022. URL https://arxiv. org/abs/2211.07937. 11

  31. [31]

    Neural policy gradient methods: Global optimality and rates of convergence, 2019

    Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence, 2019. URL https://arxiv.org/abs/1909. 01150

  32. [32]

    Hybrid rl: Using both offline and online data can make rl efficient,

    Yuda Song, Yifei Zhou, Ayush Sekhari, J Andrew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718, 2022

  33. [33]

    Mechanism design via differential privacy

    Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 94–103. IEEE, 2007

  34. [34]

    Insufficient statistics perturbation: Stable estimators for private least squares.arXiv preprint arXiv:2404.15409, 2024

    Gavin Brown, Jonathan Hayase, Samuel Hopkins, Weihao Kong, Xiyang Liu, Sewoong Oh, Juan C Perdomo, and Adam Smith. Insufficient statistics perturbation: Stable estimators for private least squares.arXiv preprint arXiv:2404.15409, 2024

  35. [35]

    Near-optimal private learning in linear contextual bandits, 2025

    Fan Chen, Jiachun Li, Alexander Rakhlin, and David Simchi-Levi. Near-optimal private learning in linear contextual bandits, 2025. URLhttps://arxiv.org/abs/2502.13115

  36. [36]

    Linear convergence of natural policy gradient methods with log-linear policies.arXiv preprint arXiv:2210.01400, 2022

    Rui Yuan, Simon S Du, Robert M Gower, Alessandro Lazaric, and Lin Xiao. Linear convergence of natural policy gradient methods with log-linear policies.arXiv preprint arXiv:2210.01400, 2022

  37. [37]

    Stochastic variance-reduced policy gradient

    Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. InInternational conference on machine learning, pages 4026–4035. PMLR, 2018

  38. [38]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  39. [39]

    Global optimality guarantees for policy gradient methods

    Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. Operations Research, 72(5):1906–1927, 2024

  40. [40]

    Neural proximal/trust region policy optimization attains globally optimal policy (2019).arXiv preprint arXiv:1906.10306, 1906

    Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy (2019).arXiv preprint arXiv:1906.10306, 1906

  41. [41]

    An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020

    Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020

  42. [42]

    Optimistic policy optimization with bandit feedback

    Lior Shani, Yonathan Efroni, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. InInternational Conference on Machine Learning, pages 8604–8613. PMLR, 2020

  43. [43]

    Provably efficient exploration in policy optimization

    Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. InInternational Conference on Machine Learning, pages 1283–1294. PMLR, 2020

  44. [44]

    Pc-pg: Policy cover directed exploration for provable policy gradient learning.Advances in neural information processing systems, 33:13399–13412, 2020

    Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning.Advances in neural information processing systems, 33:13399–13412, 2020

  45. [45]

    Cautiously optimistic policy optimiza- tion and exploration with linear function approximation

    Andrea Zanette, Ching-An Cheng, and Alekh Agarwal. Cautiously optimistic policy optimiza- tion and exploration with linear function approximation. InConference on Learning Theory, pages 4473–4525. PMLR, 2021

  46. [46]

    Optimistic natural policy gradient: a simple efficient policy optimization framework for online rl.Advances in Neural Information Processing Systems, 36:3560–3577, 2023

    Qinghua Liu, Gellért Weisz, András György, Chi Jin, and Csaba Szepesvári. Optimistic natural policy gradient: a simple efficient policy optimization framework for online rl.Advances in Neural Information Processing Systems, 36:3560–3577, 2023

  47. [47]

    (nearly) optimal differentially private stochastic multi-arm bandits

    Nikita Mishra and Abhradeep Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. InProceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 592–601, 2015. 12

  48. [48]

    An optimal private stochastic-mab algorithm based on optimal private stopping rule

    Touqir Sajed and Or Sheffet. An optimal private stochastic-mab algorithm based on optimal private stopping rule. InInternational Conference on Machine Learning, pages 5579–5588. PMLR, 2019

  49. [49]

    Distributed differential privacy in multi-armed bandits.arXiv preprint arXiv:2206.05772, 2022

    Sayak Ray Chowdhury and Xingyu Zhou. Distributed differential privacy in multi-armed bandits.arXiv preprint arXiv:2206.05772, 2022

  50. [50]

    Multi-armed bandits with local differential privacy.arXiv preprint arXiv:2007.03121, 2020

    Wenbo Ren, Xingyu Zhou, Jia Liu, and Ness B Shroff. Multi-armed bandits with local differential privacy.arXiv preprint arXiv:2007.03121, 2020

  51. [51]

    On private and robust bandits.Advances in Neural Information Processing Systems, 36:34778–34790, 2023

    Yulian Wu, Xingyu Zhou, Youming Tao, and Di Wang. On private and robust bandits.Advances in Neural Information Processing Systems, 36:34778–34790, 2023

  52. [52]

    A reduction from linear contextual bandits lower bounds to estimations lower bounds

    Jiahao He, Jiheng Zhang, and Rachel Q Zhang. A reduction from linear contextual bandits lower bounds to estimations lower bounds. InInternational Conference on Machine Learning, pages 8660–8677. PMLR, 2022

  53. [53]

    Shuffle private linear contextual bandits.arXiv preprint arXiv:2202.05567, 2022

    Sayak Ray Chowdhury and Xingyu Zhou. Shuffle private linear contextual bandits.arXiv preprint arXiv:2202.05567, 2022

  54. [54]

    On differentially private federated linear contextual bandits.arXiv preprint arXiv:2302.13945, 2023

    Xingyu Zhou and Sayak Ray Chowdhury. On differentially private federated linear contextual bandits.arXiv preprint arXiv:2302.13945, 2023

  55. [55]

    Differentially private ex- ploration in reinforcement learning with linear representation.arXiv preprint arXiv:2112.01585, 2021

    Paul Luyo, Evrard Garcelon, Alessandro Lazaric, and Matteo Pirotta. Differentially private ex- ploration in reinforcement learning with linear representation.arXiv preprint arXiv:2112.01585, 2021

  56. [56]

    Private stochastic convex optimization: optimal rates in linear time

    Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. InProceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439–449, 2020

  57. [57]

    Private stochastic convex optimiza- tion: Optimal rates in l1 geometry

    Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimiza- tion: Optimal rates in l1 geometry. InInternational Conference on Machine Learning, pages 393–403. PMLR, 2021

  58. [58]

    Non-euclidean differentially private stochastic convex optimization

    Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. InConference on Learning Theory, pages 474–499. PMLR, 2021

  59. [59]

    Private non-smooth empirical risk minimiza- tion and stochastic convex optimization in subquadratic steps.arXiv preprint arXiv:2103.15352, 2021

    Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth empirical risk minimiza- tion and stochastic convex optimization in subquadratic steps.arXiv preprint arXiv:2103.15352, 2021

  60. [60]

    Private federated learning without a trusted server: Optimal algorithms for convex losses

    Andrew Lowy and Meisam Razaviyayn. Private federated learning without a trusted server: Optimal algorithms for convex losses. InThe Eleventh International Conference on Learning Representations, 2023

  61. [61]

    Changyu Gao, Andrew Lowy, Xingyu Zhou, and Stephen J Wright. Private heterogeneous federated learning without a trusted server revisited: Error-optimal and communication-efficient algorithms for convex losses.arXiv preprint arXiv:2407.09690, 2024

  62. [62]

    Differentially private empirical risk minimization with non-convex loss functions

    Di Wang, Changyou Chen, and Jinhui Xu. Differentially private empirical risk minimization with non-convex loss functions. InInternational Conference on Machine Learning, pages 6526–6535. PMLR, 2019

  63. [63]

    Private stochastic non-convex optimization: Adaptive algorithms and tighter generalization bounds

    Yingxue Zhou, Xiangyi Chen, Mingyi Hong, Zhiwei Steven Wu, and Arindam Banerjee. Private stochastic non-convex optimization: Adaptive algorithms and tighter generalization bounds. arXiv preprint arXiv:2006.13501, 2020

  64. [64]

    Faster rates of convergence to stationary points in differentially private optimization

    Raman Arora, Raef Bassily, Tomás González, Cristóbal A Guzmán, Michael Menart, and Enayat Ullah. Faster rates of convergence to stationary points in differentially private optimization. In International Conference on Machine Learning, pages 1060–1092. PMLR, 2023. 13

  65. [65]

    Private (stochastic) non-convex optimization revisited: Second-order stationary points and excess risks.Advances in Neural Information Processing Systems, 36:65618–65641, 2023

    Daogao Liu, Arun Ganesh, Sewoong Oh, and Abhradeep Guha Thakurta. Private (stochastic) non-convex optimization revisited: Second-order stationary points and excess risks.Advances in Neural Information Processing Systems, 36:65618–65641, 2023

  66. [66]

    Adaptive batch size for privately finding second-order stationary points.arXiv preprint arXiv:2410.07502, 2024

    Daogao Liu and Kunal Talwar. Adaptive batch size for privately finding second-order stationary points.arXiv preprint arXiv:2410.07502, 2024

  67. [67]

    Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain

    Yu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain.arXiv preprint arXiv:1803.02596, 2018

  68. [68]

    Old techniques in differentially private linear regression

    Or Sheffet. Old techniques in differentially private linear regression. InAlgorithmic Learning Theory, pages 789–827. PMLR, 2019

  69. [69]

    The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy.The Annals of Statistics, 49(5):2825–2850, 2021

    T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy.The Annals of Statistics, 49(5):2825–2850, 2021

  70. [70]

    Private gradient descent for linear regression: Tighter error bounds and instance-specific uncertainty estimation.arXiv preprint arXiv:2402.13531, 2024

    Gavin Brown, Krishnamurthy Dvijotham, Georgina Evans, Daogao Liu, Adam Smith, and Abhradeep Thakurta. Private gradient descent for linear regression: Tighter error bounds and instance-specific uncertainty estimation.arXiv preprint arXiv:2402.13531, 2024

  71. [71]

    (nearly) optimal private linear regression for sub-gaussian data via adaptive clipping

    Prateek Varshney, Abhradeep Thakurta, and Prateek Jain. (nearly) optimal private linear regression for sub-gaussian data via adaptive clipping. InConference on Learning Theory, pages 1126–1166. PMLR, 2022

  72. [72]

    Label robust and differentially private linear regression: Computational and statistical efficiency.Advances in Neural Information Processing Systems, 36:23019–23033, 2023

    Xiyang Liu, Prateek Jain, Weihao Kong, Sewoong Oh, and Arun Suggala. Label robust and differentially private linear regression: Computational and statistical efficiency.Advances in Neural Information Processing Systems, 36:23019–23033, 2023

  73. [73]

    Sample-optimal private regression in polynomial time.arXiv preprint arXiv:2503.24321, 2025

    Prashanti Anderson, Ainesh Bakshi, Mahbod Majid, and Stefan Tiegel. Sample-optimal private regression in polynomial time.arXiv preprint arXiv:2503.24321, 2025

  74. [74]

    Limitations

    Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making.arXiv preprint arXiv:2112.13487, 2021. NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: : In the a...

  75. [75]

    e∇mJ(θ) + λ |X | 1 |Y| −[π x(θ)]x∈X 2# =E

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

  76. [76]

    who use PPO instead of NPG, and (iii) as privacy budget decreases (smaller ε), performance degrades as expected from theory. 36 AlgorithmεMean Final Reward Std Final Reward Best Epoch Mean PG N/A 334.37 25.25 361.70 DP-PG5 190.34 52.91 199.17 DP-PG3 143.87 22.88 187.17 NPG N/A 492.90 10.04 500.00 DP-NPG5 478.73 28.05 494.70 DP-NPG3 400.87 65.37 410.43 Tab...