pith. sign in

arxiv: 2606.01952 · v1 · pith:XZ3TVSNNnew · submitted 2026-06-01 · 💻 cs.LG

Randomized Least Squares Value Iteration itself is Joint Differentially Private

Pith reviewed 2026-06-28 16:02 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learningdifferential privacyRLSVItabular MDPjoint differential privacyrandomized explorationprivacy-preserving RLepisodic MDP
0
0 comments X

The pith

RLSVI is jointly differentially private in tabular episodic MDPs using only its existing exploration noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the Gaussian noise already added in Randomized Least Squares Value Iteration for exploration also satisfies joint differential privacy without any extra mechanisms. This holds for finite tabular episodic MDPs, with an explicit bound on the privacy parameter that depends on the number of states, actions, episodes, and horizon length. A reader would care because RL is moving into domains that require privacy protection for user data, and this result indicates that privacy can be obtained from the randomization that is already present for good performance. The analysis treats the noise schedule as the sole source of randomness and derives the joint DP guarantee directly from it.

Core claim

We prove that RLSVI is (ε(δ),δ)-joint differentially private in tabular MDP as is with ε(δ) = 2AK/(H² log(2HSA)) + 2√(2AK log(1/δ)/(H² log(2HSA))), where S and A are the number of states and actions respectively, H is the length of an episode and K is the number of episodes.

What carries the argument

The Gaussian noise added to least-squares value estimates in RLSVI, whose variance schedule for exploration is shown to simultaneously mask individual trajectories for joint DP.

If this is right

  • Standard RLSVI implementations already meet joint DP requirements in the tabular episodic setting without added privacy mechanisms.
  • The privacy parameter ε decreases as the horizon H grows relative to the number of episodes K.
  • Joint DP protects the full sequence of user interactions across episodes under the derived bound.
  • No trade-off between exploration and privacy needs to be introduced for this algorithm class.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar noise-based randomized RL algorithms may inherit joint DP if their perturbation distributions match the analyzed schedule.
  • The result suggests that variance tuning in RLSVI could be used to explicitly control the privacy level while preserving exploration.
  • Extending the analysis beyond tabular MDPs would require handling function approximation or continuous state spaces.

Load-bearing premise

The noise must follow the exact Gaussian distribution and variance schedule already used in RLSVI, and the MDP must be finite, tabular, and episodic with known horizon.

What would settle it

Replace the Gaussian noise in RLSVI with a different distribution or variance schedule and check whether the stated ε(δ) bound still holds for the same MDP parameters.

read the original abstract

As reinforcement learning (RL) increasingly applies to sensitive domains, such as health care and recommendation systems, privacy-preserving techniques have become essential to protect users' sensitive information. We investigate privacy-preserving RL under an episodic setting, focusing on algorithms based on randomized exploration, such as Randomized Least Squares Value Iteration (RLSVI). The overall goal is to study how randomized exploration interacts with the injected noise required by privacy mechanisms. In this work, we show a new privacy analysis that characterizes how the noise in RLSVI set for exploration simultaneously provides privacy protection. Specifically, we prove that RLSVI is $(\varepsilon(\delta),\delta)$-joint differentially private in tabular MDP as is with $\varepsilon(\delta) = \frac{2AK}{H^2\log(2HSA)} + 2\sqrt{\frac{2AK\log(1/\delta)}{H^2\log(2HSA)}}$, where $S$ and $A$ are the number of states and actions respectively, $H$ is the length of an episode and $K$ is the number of episodes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that Randomized Least Squares Value Iteration (RLSVI) is (ε(δ), δ)-joint differentially private in finite tabular episodic MDPs, where the privacy is provided solely by the Gaussian noise already used for randomized exploration. The explicit bound is ε(δ) = 2AK / (H² log(2HSA)) + 2 sqrt(2AK log(1/δ) / (H² log(2HSA))), with S, A, H, K denoting the number of states, actions, episode length, and number of episodes, respectively.

Significance. If the proof holds, the result is significant because it shows that the exploration noise in RLSVI can be repurposed to provide joint differential privacy without additional modifications or noise, which is advantageous for maintaining the algorithm's performance in privacy-sensitive domains. The paper earns credit for deriving a concrete, closed-form privacy bound directly from the algorithm's existing parameters and noise schedule, making the privacy guarantee 'free' in terms of extra design choices.

minor comments (2)
  1. [Abstract] The bound is presented without specifying the base of the logarithm or the exact noise distribution (assumed Gaussian); this should be clarified for precision.
  2. The manuscript would benefit from explicitly listing the assumptions (tabular finite MDP, episodic with known H, specific variance schedule) in a dedicated section or early in the introduction.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation of minor revision. We appreciate the acknowledgment that the result would be significant if the proof holds, as it demonstrates that the existing Gaussian noise in RLSVI can be repurposed to achieve joint differential privacy without additional modifications.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central result is a direct mathematical derivation showing that the Gaussian noise and variance schedule already present in RLSVI for exploration suffice to establish the stated (ε(δ), δ)-joint DP bound under the paper's assumptions of finite tabular episodic MDPs. The ε(δ) expression is obtained by applying standard privacy composition and concentration arguments to the algorithm's existing randomness parameters; it does not arise from fitting any quantity to data, redefining a target in terms of itself, or invoking a load-bearing self-citation whose validity depends on the present work. The derivation chain is therefore self-contained and independent of the target privacy claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the standard mathematical definition of joint differential privacy and the known noise properties of RLSVI; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • standard math Standard definition of (ε,δ)-joint differential privacy
    The proof invokes the definition of joint DP to bound the privacy loss from the algorithm's randomness.

pith-pipeline@v0.9.1-grok · 5725 in / 1232 out tokens · 23282 ms · 2026-06-28T16:02:36.782337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Agrawal, J

    P. Agrawal, J. Chen, and N. Jiang. Improved worst-case regret bounds for randomized least- squares value iteration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6566–6573, 2021

  2. [2]

    Azize and D

    A. Azize and D. Basu. When privacy meets partial information: A refined analysis of differen- tially private bandits.Advances in Neural Information Processing Systems, 35:32199–32210, 2022

  3. [3]

    Azize and D

    A. Azize and D. Basu. Rényi differentially private bandits. 2023

  4. [4]

    S. Bai, M. S. Talebi, C. Zhao, P. Cheng, and J. Chen. Near-Optimal Reinforcement Learning With Shuffle Differential Privacy.IEEE Transactions on Network Science and Engineering, pages 1–20, 2026

  5. [5]

    D. Basu, C. Dimitrakakis, and A. Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost?Computing Research Repository (CoRR), 1905, 2019

  6. [6]

    Exploration by Random Network Distillation

    Y . Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018

  7. [7]

    Chapelle and L

    O. Chapelle and L. Li. An empirical evaluation of thompson sampling.Advances in neural information processing systems, 24, 2011

  8. [8]

    A. Cheu, A. Smith, J. Ullman, D. Zeber, and M. Zhilyaev. Distributed differential privacy via shuffling. InAnnual international conference on the theory and applications of cryptographic techniques, pages 375–403. Springer, 2019

  9. [9]

    S. R. Chowdhury and X. Zhou. Differentially private regret minimization in episodic markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6375–6383, 2022

  10. [10]

    J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th annual symposium on foundations of computer science, pages 429–438. IEEE, 2013

  11. [11]

    Dwork, F

    C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. InTheory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer, 2006

  12. [12]

    Dwork, A

    C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy.Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014

  13. [13]

    Garcelon, V

    E. Garcelon, V . Perchet, C. Pike-Burke, and M. Pirotta. Local differential privacy for regret minimization in reinforcement learning.Advances in Neural Information Processing Systems, 34:10561–10573, 2021

  14. [14]

    J. Hsu, Z. Huang, A. Roth, T. Roughgarden, and Z. S. Wu. Private matchings and allocations. InProceedings of the forty-sixth annual ACM symposium on Theory of computing, STOC ’14, pages 21–30, New York, NY , USA, May 2014. Association for Computing Machinery

  15. [15]

    Hu and N

    B. Hu and N. Hegde. Near-optimal thompson sampling-based algorithms for differentially private stochastic bandits. InUncertainty in Artificial Intelligence, pages 844–852. PMLR, 2022

  16. [16]

    P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. InConference on Learning Theory, pages 24–1. JMLR Workshop and Conference Proceedings, 2012

  17. [17]

    C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is q-learning provably efficient?Advances in neural information processing systems, 31, 2018

  18. [18]

    Kearns, M

    M. Kearns, M. Pai, A. Roth, and J. Ullman. Mechanism design in large games: Incentives and privacy. InProceedings of the 5th conference on Innovations in theoretical computer science, pages 403–410, 2014. 10

  19. [19]

    Y . Lei, D. Ye, S. Shen, Y . Sui, T. Zhu, and W. Zhou. New challenges in reinforcement learning: a survey of security and privacy.Artificial Intelligence Review, 56(7):7195–7236, July 2023

  20. [20]

    Q. Liu, T. Yu, Y . Bai, and C. Jin. A sharp analysis of model-based reinforcement learning with self-play. InInternational Conference on Machine Learning, pages 7001–7010. PMLR, 2021

  21. [21]

    I. Mironov. Rényi differential privacy. In2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017

  22. [22]

    Mishra and A

    N. Mishra and A. Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. InProceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 592–601, 2015

  23. [23]

    Osband, J

    I. Osband, J. Aslanides, and A. Cassirer. Randomized prior functions for deep reinforcement learning.Advances in neural information processing systems, 31, 2018

  24. [24]

    Osband, B

    I. Osband, B. V . Roy, D. J. Russo, and Z. Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1–62, 2019

  25. [25]

    T. Ou, R. Cummings, and M. A. Medina. Thompson sampling itself is differentially private. InInternational Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2024

  26. [26]

    Qiao and Y .-X

    D. Qiao and Y .-X. Wang. Near-optimal differentially private reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 9914–9940. PMLR, 2023

  27. [27]

    Qiao and Y .-X

    D. Qiao and Y .-X. Wang. Differentially private reinforcement learning with self-play.Advances in Neural Information Processing Systems, 37:88117–88146, 2024

  28. [28]

    D. Russo. Worst-case regret bounds for exploration via randomized value functions.Advances in neural information processing systems, 32, 2019

  29. [29]

    Sajed and O

    T. Sajed and O. Sheffet. An optimal private stochastic-mab algorithm based on optimal private stopping rule. InInternational Conference on Machine Learning, pages 5579–5588. PMLR, 2019

  30. [30]

    Shariff and O

    R. Shariff and O. Sheffet. Differentially private contextual linear bandits.Advances in Neural Information Processing Systems, 31, 2018

  31. [31]

    Sutton and A

    R. Sutton and A. Barto.Reinforcement Learning: An Introduction. MIT Press, 2018

  32. [32]

    Tossou and C

    A. Tossou and C. Dimitrakakis. Algorithms for differentially private multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016

  33. [33]

    Van Erven and P

    T. Van Erven and P. Harremos. Rényi divergence and kullback-leibler divergence.IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

  34. [34]

    Vietri, B

    G. Vietri, B. Balle, A. Krishnamurthy, and S. Wu. Private reinforcement learning with pac and regret guarantees. InInternational Conference on Machine Learning, pages 9754–9764. PMLR, 2020

  35. [35]

    Xiong, R

    Z. Xiong, R. Shen, Q. Cui, M. Fazel, and S. S. Du. Near-optimal randomized exploration for tabular markov decision processes.Advances in neural information processing systems, 35:6358–6371, 2022

  36. [36]

    Zanette, D

    A. Zanette, D. Brandfonbrener, E. Brunskill, M. Pirotta, and A. Lazaric. Frequentist regret bounds for randomized least-squares value iteration. InInternational Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR, 2020

  37. [37]

    Zhang, Z

    K. Zhang, Z. Yang, and T. Ba¸ sar. Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021

  38. [38]

    Zheng, T

    K. Zheng, T. Cai, W. Huang, Z. Li, and L. Wang. Locally differentially private (contextual) bandits learning.Advances in Neural Information Processing Systems, 33:12300–12310, 2020. 11 A Technical Appendices and Supplementary Material Proof of Lemma 6. Dα(Gσ(D)∥G σ(D′)) =Dα(N(f(D), σ 2)∥ N(f(D ′), σ2)) = α(f(D)−f(D ′))2 2σ2 ≤ α∆2 2σ2 The second equality h...