Randomized Least Squares Value Iteration itself is Joint Differentially Private
Pith reviewed 2026-06-28 16:02 UTC · model grok-4.3
The pith
RLSVI is jointly differentially private in tabular episodic MDPs using only its existing exploration noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that RLSVI is (ε(δ),δ)-joint differentially private in tabular MDP as is with ε(δ) = 2AK/(H² log(2HSA)) + 2√(2AK log(1/δ)/(H² log(2HSA))), where S and A are the number of states and actions respectively, H is the length of an episode and K is the number of episodes.
What carries the argument
The Gaussian noise added to least-squares value estimates in RLSVI, whose variance schedule for exploration is shown to simultaneously mask individual trajectories for joint DP.
If this is right
- Standard RLSVI implementations already meet joint DP requirements in the tabular episodic setting without added privacy mechanisms.
- The privacy parameter ε decreases as the horizon H grows relative to the number of episodes K.
- Joint DP protects the full sequence of user interactions across episodes under the derived bound.
- No trade-off between exploration and privacy needs to be introduced for this algorithm class.
Where Pith is reading between the lines
- Similar noise-based randomized RL algorithms may inherit joint DP if their perturbation distributions match the analyzed schedule.
- The result suggests that variance tuning in RLSVI could be used to explicitly control the privacy level while preserving exploration.
- Extending the analysis beyond tabular MDPs would require handling function approximation or continuous state spaces.
Load-bearing premise
The noise must follow the exact Gaussian distribution and variance schedule already used in RLSVI, and the MDP must be finite, tabular, and episodic with known horizon.
What would settle it
Replace the Gaussian noise in RLSVI with a different distribution or variance schedule and check whether the stated ε(δ) bound still holds for the same MDP parameters.
read the original abstract
As reinforcement learning (RL) increasingly applies to sensitive domains, such as health care and recommendation systems, privacy-preserving techniques have become essential to protect users' sensitive information. We investigate privacy-preserving RL under an episodic setting, focusing on algorithms based on randomized exploration, such as Randomized Least Squares Value Iteration (RLSVI). The overall goal is to study how randomized exploration interacts with the injected noise required by privacy mechanisms. In this work, we show a new privacy analysis that characterizes how the noise in RLSVI set for exploration simultaneously provides privacy protection. Specifically, we prove that RLSVI is $(\varepsilon(\delta),\delta)$-joint differentially private in tabular MDP as is with $\varepsilon(\delta) = \frac{2AK}{H^2\log(2HSA)} + 2\sqrt{\frac{2AK\log(1/\delta)}{H^2\log(2HSA)}}$, where $S$ and $A$ are the number of states and actions respectively, $H$ is the length of an episode and $K$ is the number of episodes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that Randomized Least Squares Value Iteration (RLSVI) is (ε(δ), δ)-joint differentially private in finite tabular episodic MDPs, where the privacy is provided solely by the Gaussian noise already used for randomized exploration. The explicit bound is ε(δ) = 2AK / (H² log(2HSA)) + 2 sqrt(2AK log(1/δ) / (H² log(2HSA))), with S, A, H, K denoting the number of states, actions, episode length, and number of episodes, respectively.
Significance. If the proof holds, the result is significant because it shows that the exploration noise in RLSVI can be repurposed to provide joint differential privacy without additional modifications or noise, which is advantageous for maintaining the algorithm's performance in privacy-sensitive domains. The paper earns credit for deriving a concrete, closed-form privacy bound directly from the algorithm's existing parameters and noise schedule, making the privacy guarantee 'free' in terms of extra design choices.
minor comments (2)
- [Abstract] The bound is presented without specifying the base of the logarithm or the exact noise distribution (assumed Gaussian); this should be clarified for precision.
- The manuscript would benefit from explicitly listing the assumptions (tabular finite MDP, episodic with known H, specific variance schedule) in a dedicated section or early in the introduction.
Simulated Author's Rebuttal
We thank the referee for the positive review and the recommendation of minor revision. We appreciate the acknowledgment that the result would be significant if the proof holds, as it demonstrates that the existing Gaussian noise in RLSVI can be repurposed to achieve joint differential privacy without additional modifications.
Circularity Check
No significant circularity detected
full rationale
The paper's central result is a direct mathematical derivation showing that the Gaussian noise and variance schedule already present in RLSVI for exploration suffice to establish the stated (ε(δ), δ)-joint DP bound under the paper's assumptions of finite tabular episodic MDPs. The ε(δ) expression is obtained by applying standard privacy composition and concentration arguments to the algorithm's existing randomness parameters; it does not arise from fitting any quantity to data, redefining a target in terms of itself, or invoking a load-bearing self-citation whose validity depends on the present work. The derivation chain is therefore self-contained and independent of the target privacy claim.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard definition of (ε,δ)-joint differential privacy
Reference graph
Works this paper leans on
-
[1]
Agrawal, J
P. Agrawal, J. Chen, and N. Jiang. Improved worst-case regret bounds for randomized least- squares value iteration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6566–6573, 2021
2021
-
[2]
Azize and D
A. Azize and D. Basu. When privacy meets partial information: A refined analysis of differen- tially private bandits.Advances in Neural Information Processing Systems, 35:32199–32210, 2022
2022
-
[3]
Azize and D
A. Azize and D. Basu. Rényi differentially private bandits. 2023
2023
-
[4]
S. Bai, M. S. Talebi, C. Zhao, P. Cheng, and J. Chen. Near-Optimal Reinforcement Learning With Shuffle Differential Privacy.IEEE Transactions on Network Science and Engineering, pages 1–20, 2026
2026
-
[5]
D. Basu, C. Dimitrakakis, and A. Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost?Computing Research Repository (CoRR), 1905, 2019
1905
-
[6]
Exploration by Random Network Distillation
Y . Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
Chapelle and L
O. Chapelle and L. Li. An empirical evaluation of thompson sampling.Advances in neural information processing systems, 24, 2011
2011
-
[8]
A. Cheu, A. Smith, J. Ullman, D. Zeber, and M. Zhilyaev. Distributed differential privacy via shuffling. InAnnual international conference on the theory and applications of cryptographic techniques, pages 375–403. Springer, 2019
2019
-
[9]
S. R. Chowdhury and X. Zhou. Differentially private regret minimization in episodic markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6375–6383, 2022
2022
-
[10]
J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th annual symposium on foundations of computer science, pages 429–438. IEEE, 2013
2013
-
[11]
Dwork, F
C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. InTheory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer, 2006
2006
-
[12]
Dwork, A
C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy.Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014
2014
-
[13]
Garcelon, V
E. Garcelon, V . Perchet, C. Pike-Burke, and M. Pirotta. Local differential privacy for regret minimization in reinforcement learning.Advances in Neural Information Processing Systems, 34:10561–10573, 2021
2021
-
[14]
J. Hsu, Z. Huang, A. Roth, T. Roughgarden, and Z. S. Wu. Private matchings and allocations. InProceedings of the forty-sixth annual ACM symposium on Theory of computing, STOC ’14, pages 21–30, New York, NY , USA, May 2014. Association for Computing Machinery
2014
-
[15]
Hu and N
B. Hu and N. Hegde. Near-optimal thompson sampling-based algorithms for differentially private stochastic bandits. InUncertainty in Artificial Intelligence, pages 844–852. PMLR, 2022
2022
-
[16]
P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. InConference on Learning Theory, pages 24–1. JMLR Workshop and Conference Proceedings, 2012
2012
-
[17]
C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is q-learning provably efficient?Advances in neural information processing systems, 31, 2018
2018
-
[18]
Kearns, M
M. Kearns, M. Pai, A. Roth, and J. Ullman. Mechanism design in large games: Incentives and privacy. InProceedings of the 5th conference on Innovations in theoretical computer science, pages 403–410, 2014. 10
2014
-
[19]
Y . Lei, D. Ye, S. Shen, Y . Sui, T. Zhu, and W. Zhou. New challenges in reinforcement learning: a survey of security and privacy.Artificial Intelligence Review, 56(7):7195–7236, July 2023
2023
-
[20]
Q. Liu, T. Yu, Y . Bai, and C. Jin. A sharp analysis of model-based reinforcement learning with self-play. InInternational Conference on Machine Learning, pages 7001–7010. PMLR, 2021
2021
-
[21]
I. Mironov. Rényi differential privacy. In2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017
2017
-
[22]
Mishra and A
N. Mishra and A. Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. InProceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 592–601, 2015
2015
-
[23]
Osband, J
I. Osband, J. Aslanides, and A. Cassirer. Randomized prior functions for deep reinforcement learning.Advances in neural information processing systems, 31, 2018
2018
-
[24]
Osband, B
I. Osband, B. V . Roy, D. J. Russo, and Z. Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1–62, 2019
2019
-
[25]
T. Ou, R. Cummings, and M. A. Medina. Thompson sampling itself is differentially private. InInternational Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2024
2024
-
[26]
Qiao and Y .-X
D. Qiao and Y .-X. Wang. Near-optimal differentially private reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 9914–9940. PMLR, 2023
2023
-
[27]
Qiao and Y .-X
D. Qiao and Y .-X. Wang. Differentially private reinforcement learning with self-play.Advances in Neural Information Processing Systems, 37:88117–88146, 2024
2024
-
[28]
D. Russo. Worst-case regret bounds for exploration via randomized value functions.Advances in neural information processing systems, 32, 2019
2019
-
[29]
Sajed and O
T. Sajed and O. Sheffet. An optimal private stochastic-mab algorithm based on optimal private stopping rule. InInternational Conference on Machine Learning, pages 5579–5588. PMLR, 2019
2019
-
[30]
Shariff and O
R. Shariff and O. Sheffet. Differentially private contextual linear bandits.Advances in Neural Information Processing Systems, 31, 2018
2018
-
[31]
Sutton and A
R. Sutton and A. Barto.Reinforcement Learning: An Introduction. MIT Press, 2018
2018
-
[32]
Tossou and C
A. Tossou and C. Dimitrakakis. Algorithms for differentially private multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016
2016
-
[33]
Van Erven and P
T. Van Erven and P. Harremos. Rényi divergence and kullback-leibler divergence.IEEE Transactions on Information Theory, 60(7):3797–3820, 2014
2014
-
[34]
Vietri, B
G. Vietri, B. Balle, A. Krishnamurthy, and S. Wu. Private reinforcement learning with pac and regret guarantees. InInternational Conference on Machine Learning, pages 9754–9764. PMLR, 2020
2020
-
[35]
Xiong, R
Z. Xiong, R. Shen, Q. Cui, M. Fazel, and S. S. Du. Near-optimal randomized exploration for tabular markov decision processes.Advances in neural information processing systems, 35:6358–6371, 2022
2022
-
[36]
Zanette, D
A. Zanette, D. Brandfonbrener, E. Brunskill, M. Pirotta, and A. Lazaric. Frequentist regret bounds for randomized least-squares value iteration. InInternational Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR, 2020
1954
-
[37]
Zhang, Z
K. Zhang, Z. Yang, and T. Ba¸ sar. Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021
2021
-
[38]
Zheng, T
K. Zheng, T. Cai, W. Huang, Z. Li, and L. Wang. Locally differentially private (contextual) bandits learning.Advances in Neural Information Processing Systems, 33:12300–12310, 2020. 11 A Technical Appendices and Supplementary Material Proof of Lemma 6. Dα(Gσ(D)∥G σ(D′)) =Dα(N(f(D), σ 2)∥ N(f(D ′), σ2)) = α(f(D)−f(D ′))2 2σ2 ≤ α∆2 2σ2 The second equality h...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.