Randomized Least Squares Value Iteration itself is Joint Differentially Private

Haiyang Lu; Mohammad Sadegh Talebi; Pratik Gajane; Shaojie Bai

arxiv: 2606.01952 · v1 · pith:XZ3TVSNNnew · submitted 2026-06-01 · 💻 cs.LG

Randomized Least Squares Value Iteration itself is Joint Differentially Private

Haiyang Lu , Pratik Gajane , Shaojie Bai , Mohammad Sadegh Talebi This is my paper

Pith reviewed 2026-06-28 16:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningdifferential privacyRLSVItabular MDPjoint differential privacyrandomized explorationprivacy-preserving RLepisodic MDP

0 comments

The pith

RLSVI is jointly differentially private in tabular episodic MDPs using only its existing exploration noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the Gaussian noise already added in Randomized Least Squares Value Iteration for exploration also satisfies joint differential privacy without any extra mechanisms. This holds for finite tabular episodic MDPs, with an explicit bound on the privacy parameter that depends on the number of states, actions, episodes, and horizon length. A reader would care because RL is moving into domains that require privacy protection for user data, and this result indicates that privacy can be obtained from the randomization that is already present for good performance. The analysis treats the noise schedule as the sole source of randomness and derives the joint DP guarantee directly from it.

Core claim

We prove that RLSVI is (ε(δ),δ)-joint differentially private in tabular MDP as is with ε(δ) = 2AK/(H² log(2HSA)) + 2√(2AK log(1/δ)/(H² log(2HSA))), where S and A are the number of states and actions respectively, H is the length of an episode and K is the number of episodes.

What carries the argument

The Gaussian noise added to least-squares value estimates in RLSVI, whose variance schedule for exploration is shown to simultaneously mask individual trajectories for joint DP.

If this is right

Standard RLSVI implementations already meet joint DP requirements in the tabular episodic setting without added privacy mechanisms.
The privacy parameter ε decreases as the horizon H grows relative to the number of episodes K.
Joint DP protects the full sequence of user interactions across episodes under the derived bound.
No trade-off between exploration and privacy needs to be introduced for this algorithm class.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar noise-based randomized RL algorithms may inherit joint DP if their perturbation distributions match the analyzed schedule.
The result suggests that variance tuning in RLSVI could be used to explicitly control the privacy level while preserving exploration.
Extending the analysis beyond tabular MDPs would require handling function approximation or continuous state spaces.

Load-bearing premise

The noise must follow the exact Gaussian distribution and variance schedule already used in RLSVI, and the MDP must be finite, tabular, and episodic with known horizon.

What would settle it

Replace the Gaussian noise in RLSVI with a different distribution or variance schedule and check whether the stated ε(δ) bound still holds for the same MDP parameters.

read the original abstract

As reinforcement learning (RL) increasingly applies to sensitive domains, such as health care and recommendation systems, privacy-preserving techniques have become essential to protect users' sensitive information. We investigate privacy-preserving RL under an episodic setting, focusing on algorithms based on randomized exploration, such as Randomized Least Squares Value Iteration (RLSVI). The overall goal is to study how randomized exploration interacts with the injected noise required by privacy mechanisms. In this work, we show a new privacy analysis that characterizes how the noise in RLSVI set for exploration simultaneously provides privacy protection. Specifically, we prove that RLSVI is $(\varepsilon(\delta),\delta)$-joint differentially private in tabular MDP as is with $\varepsilon(\delta) = \frac{2AK}{H^2\log(2HSA)} + 2\sqrt{\frac{2AK\log(1/\delta)}{H^2\log(2HSA)}}$, where $S$ and $A$ are the number of states and actions respectively, $H$ is the length of an episode and $K$ is the number of episodes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLSVI's existing exploration noise already yields an explicit joint-DP bound in tabular episodic MDPs, which removes the need for separate privacy noise under the stated assumptions.

read the letter

The core observation is that the Gaussian noise already present in RLSVI for exploration satisfies (ε(δ), δ)-joint differential privacy with the given closed-form bound, without any added mechanism. The paper derives this directly from the algorithm's variance schedule and the standard properties of the Gaussian mechanism applied to the least-squares updates in finite tabular episodic MDPs.

This is new relative to the cited prior work on private RL, which typically layers separate privacy noise on top of exploration. The analysis is self-contained and avoids circularity by treating the exploration noise parameters as fixed inputs. If the proof checks out, it is a practical simplification for anyone already running RLSVI in sensitive settings.

The bound itself is explicit and depends only on S, A, H, K and the noise variance, which is a strength. However, it applies only when the noise distribution and schedule match the original RLSVI exactly; any modification requires a fresh derivation. The setting is restricted to finite tabular MDPs with known horizon, so the result does not extend to function approximation or infinite state spaces. The abstract gives the final expression but no intermediate steps, so the full paper's error analysis and assumption list need verification.

This is for people working on private RL who use randomized exploration methods and want to know the privacy cost of the noise they are already adding. A reader focused on implementation in healthcare or recommendation systems would find the explicit ε(δ) useful.

The paper deserves peer review because the claim is a clean, falsifiable mathematical statement with potential practical impact, even if the constants or tightness could be improved later.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that Randomized Least Squares Value Iteration (RLSVI) is (ε(δ), δ)-joint differentially private in finite tabular episodic MDPs, where the privacy is provided solely by the Gaussian noise already used for randomized exploration. The explicit bound is ε(δ) = 2AK / (H² log(2HSA)) + 2 sqrt(2AK log(1/δ) / (H² log(2HSA))), with S, A, H, K denoting the number of states, actions, episode length, and number of episodes, respectively.

Significance. If the proof holds, the result is significant because it shows that the exploration noise in RLSVI can be repurposed to provide joint differential privacy without additional modifications or noise, which is advantageous for maintaining the algorithm's performance in privacy-sensitive domains. The paper earns credit for deriving a concrete, closed-form privacy bound directly from the algorithm's existing parameters and noise schedule, making the privacy guarantee 'free' in terms of extra design choices.

minor comments (2)

[Abstract] The bound is presented without specifying the base of the logarithm or the exact noise distribution (assumed Gaussian); this should be clarified for precision.
The manuscript would benefit from explicitly listing the assumptions (tabular finite MDP, episodic with known H, specific variance schedule) in a dedicated section or early in the introduction.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation of minor revision. We appreciate the acknowledgment that the result would be significant if the proof holds, as it demonstrates that the existing Gaussian noise in RLSVI can be repurposed to achieve joint differential privacy without additional modifications.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central result is a direct mathematical derivation showing that the Gaussian noise and variance schedule already present in RLSVI for exploration suffice to establish the stated (ε(δ), δ)-joint DP bound under the paper's assumptions of finite tabular episodic MDPs. The ε(δ) expression is obtained by applying standard privacy composition and concentration arguments to the algorithm's existing randomness parameters; it does not arise from fitting any quantity to data, redefining a target in terms of itself, or invoking a load-bearing self-citation whose validity depends on the present work. The derivation chain is therefore self-contained and independent of the target privacy claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the standard mathematical definition of joint differential privacy and the known noise properties of RLSVI; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

standard math Standard definition of (ε,δ)-joint differential privacy
The proof invokes the definition of joint DP to bound the privacy loss from the algorithm's randomness.

pith-pipeline@v0.9.1-grok · 5725 in / 1232 out tokens · 23282 ms · 2026-06-28T16:02:36.782337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Agrawal, J

P. Agrawal, J. Chen, and N. Jiang. Improved worst-case regret bounds for randomized least- squares value iteration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6566–6573, 2021

2021
[2]

Azize and D

A. Azize and D. Basu. When privacy meets partial information: A refined analysis of differen- tially private bandits.Advances in Neural Information Processing Systems, 35:32199–32210, 2022

2022
[3]

Azize and D

A. Azize and D. Basu. Rényi differentially private bandits. 2023

2023
[4]

S. Bai, M. S. Talebi, C. Zhao, P. Cheng, and J. Chen. Near-Optimal Reinforcement Learning With Shuffle Differential Privacy.IEEE Transactions on Network Science and Engineering, pages 1–20, 2026

2026
[5]

D. Basu, C. Dimitrakakis, and A. Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost?Computing Research Repository (CoRR), 1905, 2019

1905
[6]

Exploration by Random Network Distillation

Y . Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Chapelle and L

O. Chapelle and L. Li. An empirical evaluation of thompson sampling.Advances in neural information processing systems, 24, 2011

2011
[8]

A. Cheu, A. Smith, J. Ullman, D. Zeber, and M. Zhilyaev. Distributed differential privacy via shuffling. InAnnual international conference on the theory and applications of cryptographic techniques, pages 375–403. Springer, 2019

2019
[9]

S. R. Chowdhury and X. Zhou. Differentially private regret minimization in episodic markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6375–6383, 2022

2022
[10]

J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th annual symposium on foundations of computer science, pages 429–438. IEEE, 2013

2013
[11]

Dwork, F

C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. InTheory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer, 2006

2006
[12]

Dwork, A

C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy.Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014

2014
[13]

Garcelon, V

E. Garcelon, V . Perchet, C. Pike-Burke, and M. Pirotta. Local differential privacy for regret minimization in reinforcement learning.Advances in Neural Information Processing Systems, 34:10561–10573, 2021

2021
[14]

J. Hsu, Z. Huang, A. Roth, T. Roughgarden, and Z. S. Wu. Private matchings and allocations. InProceedings of the forty-sixth annual ACM symposium on Theory of computing, STOC ’14, pages 21–30, New York, NY , USA, May 2014. Association for Computing Machinery

2014
[15]

Hu and N

B. Hu and N. Hegde. Near-optimal thompson sampling-based algorithms for differentially private stochastic bandits. InUncertainty in Artificial Intelligence, pages 844–852. PMLR, 2022

2022
[16]

P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. InConference on Learning Theory, pages 24–1. JMLR Workshop and Conference Proceedings, 2012

2012
[17]

C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is q-learning provably efficient?Advances in neural information processing systems, 31, 2018

2018
[18]

Kearns, M

M. Kearns, M. Pai, A. Roth, and J. Ullman. Mechanism design in large games: Incentives and privacy. InProceedings of the 5th conference on Innovations in theoretical computer science, pages 403–410, 2014. 10

2014
[19]

Y . Lei, D. Ye, S. Shen, Y . Sui, T. Zhu, and W. Zhou. New challenges in reinforcement learning: a survey of security and privacy.Artificial Intelligence Review, 56(7):7195–7236, July 2023

2023
[20]

Q. Liu, T. Yu, Y . Bai, and C. Jin. A sharp analysis of model-based reinforcement learning with self-play. InInternational Conference on Machine Learning, pages 7001–7010. PMLR, 2021

2021
[21]

I. Mironov. Rényi differential privacy. In2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017

2017
[22]

Mishra and A

N. Mishra and A. Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. InProceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 592–601, 2015

2015
[23]

Osband, J

I. Osband, J. Aslanides, and A. Cassirer. Randomized prior functions for deep reinforcement learning.Advances in neural information processing systems, 31, 2018

2018
[24]

Osband, B

I. Osband, B. V . Roy, D. J. Russo, and Z. Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1–62, 2019

2019
[25]

T. Ou, R. Cummings, and M. A. Medina. Thompson sampling itself is differentially private. InInternational Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2024

2024
[26]

Qiao and Y .-X

D. Qiao and Y .-X. Wang. Near-optimal differentially private reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 9914–9940. PMLR, 2023

2023
[27]

Qiao and Y .-X

D. Qiao and Y .-X. Wang. Differentially private reinforcement learning with self-play.Advances in Neural Information Processing Systems, 37:88117–88146, 2024

2024
[28]

D. Russo. Worst-case regret bounds for exploration via randomized value functions.Advances in neural information processing systems, 32, 2019

2019
[29]

Sajed and O

T. Sajed and O. Sheffet. An optimal private stochastic-mab algorithm based on optimal private stopping rule. InInternational Conference on Machine Learning, pages 5579–5588. PMLR, 2019

2019
[30]

Shariff and O

R. Shariff and O. Sheffet. Differentially private contextual linear bandits.Advances in Neural Information Processing Systems, 31, 2018

2018
[31]

Sutton and A

R. Sutton and A. Barto.Reinforcement Learning: An Introduction. MIT Press, 2018

2018
[32]

Tossou and C

A. Tossou and C. Dimitrakakis. Algorithms for differentially private multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016

2016
[33]

Van Erven and P

T. Van Erven and P. Harremos. Rényi divergence and kullback-leibler divergence.IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

2014
[34]

Vietri, B

G. Vietri, B. Balle, A. Krishnamurthy, and S. Wu. Private reinforcement learning with pac and regret guarantees. InInternational Conference on Machine Learning, pages 9754–9764. PMLR, 2020

2020
[35]

Xiong, R

Z. Xiong, R. Shen, Q. Cui, M. Fazel, and S. S. Du. Near-optimal randomized exploration for tabular markov decision processes.Advances in neural information processing systems, 35:6358–6371, 2022

2022
[36]

Zanette, D

A. Zanette, D. Brandfonbrener, E. Brunskill, M. Pirotta, and A. Lazaric. Frequentist regret bounds for randomized least-squares value iteration. InInternational Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR, 2020

1954
[37]

Zhang, Z

K. Zhang, Z. Yang, and T. Ba¸ sar. Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021

2021
[38]

Zheng, T

K. Zheng, T. Cai, W. Huang, Z. Li, and L. Wang. Locally differentially private (contextual) bandits learning.Advances in Neural Information Processing Systems, 33:12300–12310, 2020. 11 A Technical Appendices and Supplementary Material Proof of Lemma 6. Dα(Gσ(D)∥G σ(D′)) =Dα(N(f(D), σ 2)∥ N(f(D ′), σ2)) = α(f(D)−f(D ′))2 2σ2 ≤ α∆2 2σ2 The second equality h...

2020

[1] [1]

Agrawal, J

P. Agrawal, J. Chen, and N. Jiang. Improved worst-case regret bounds for randomized least- squares value iteration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6566–6573, 2021

2021

[2] [2]

Azize and D

A. Azize and D. Basu. When privacy meets partial information: A refined analysis of differen- tially private bandits.Advances in Neural Information Processing Systems, 35:32199–32210, 2022

2022

[3] [3]

Azize and D

A. Azize and D. Basu. Rényi differentially private bandits. 2023

2023

[4] [4]

S. Bai, M. S. Talebi, C. Zhao, P. Cheng, and J. Chen. Near-Optimal Reinforcement Learning With Shuffle Differential Privacy.IEEE Transactions on Network Science and Engineering, pages 1–20, 2026

2026

[5] [5]

D. Basu, C. Dimitrakakis, and A. Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost?Computing Research Repository (CoRR), 1905, 2019

1905

[6] [6]

Exploration by Random Network Distillation

Y . Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Chapelle and L

O. Chapelle and L. Li. An empirical evaluation of thompson sampling.Advances in neural information processing systems, 24, 2011

2011

[8] [8]

A. Cheu, A. Smith, J. Ullman, D. Zeber, and M. Zhilyaev. Distributed differential privacy via shuffling. InAnnual international conference on the theory and applications of cryptographic techniques, pages 375–403. Springer, 2019

2019

[9] [9]

S. R. Chowdhury and X. Zhou. Differentially private regret minimization in episodic markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6375–6383, 2022

2022

[10] [10]

J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th annual symposium on foundations of computer science, pages 429–438. IEEE, 2013

2013

[11] [11]

Dwork, F

C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. InTheory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer, 2006

2006

[12] [12]

Dwork, A

C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy.Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014

2014

[13] [13]

Garcelon, V

E. Garcelon, V . Perchet, C. Pike-Burke, and M. Pirotta. Local differential privacy for regret minimization in reinforcement learning.Advances in Neural Information Processing Systems, 34:10561–10573, 2021

2021

[14] [14]

J. Hsu, Z. Huang, A. Roth, T. Roughgarden, and Z. S. Wu. Private matchings and allocations. InProceedings of the forty-sixth annual ACM symposium on Theory of computing, STOC ’14, pages 21–30, New York, NY , USA, May 2014. Association for Computing Machinery

2014

[15] [15]

Hu and N

B. Hu and N. Hegde. Near-optimal thompson sampling-based algorithms for differentially private stochastic bandits. InUncertainty in Artificial Intelligence, pages 844–852. PMLR, 2022

2022

[16] [16]

P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. InConference on Learning Theory, pages 24–1. JMLR Workshop and Conference Proceedings, 2012

2012

[17] [17]

C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is q-learning provably efficient?Advances in neural information processing systems, 31, 2018

2018

[18] [18]

Kearns, M

M. Kearns, M. Pai, A. Roth, and J. Ullman. Mechanism design in large games: Incentives and privacy. InProceedings of the 5th conference on Innovations in theoretical computer science, pages 403–410, 2014. 10

2014

[19] [19]

Y . Lei, D. Ye, S. Shen, Y . Sui, T. Zhu, and W. Zhou. New challenges in reinforcement learning: a survey of security and privacy.Artificial Intelligence Review, 56(7):7195–7236, July 2023

2023

[20] [20]

Q. Liu, T. Yu, Y . Bai, and C. Jin. A sharp analysis of model-based reinforcement learning with self-play. InInternational Conference on Machine Learning, pages 7001–7010. PMLR, 2021

2021

[21] [21]

I. Mironov. Rényi differential privacy. In2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017

2017

[22] [22]

Mishra and A

N. Mishra and A. Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. InProceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 592–601, 2015

2015

[23] [23]

Osband, J

I. Osband, J. Aslanides, and A. Cassirer. Randomized prior functions for deep reinforcement learning.Advances in neural information processing systems, 31, 2018

2018

[24] [24]

Osband, B

I. Osband, B. V . Roy, D. J. Russo, and Z. Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1–62, 2019

2019

[25] [25]

T. Ou, R. Cummings, and M. A. Medina. Thompson sampling itself is differentially private. InInternational Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2024

2024

[26] [26]

Qiao and Y .-X

D. Qiao and Y .-X. Wang. Near-optimal differentially private reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 9914–9940. PMLR, 2023

2023

[27] [27]

Qiao and Y .-X

D. Qiao and Y .-X. Wang. Differentially private reinforcement learning with self-play.Advances in Neural Information Processing Systems, 37:88117–88146, 2024

2024

[28] [28]

D. Russo. Worst-case regret bounds for exploration via randomized value functions.Advances in neural information processing systems, 32, 2019

2019

[29] [29]

Sajed and O

T. Sajed and O. Sheffet. An optimal private stochastic-mab algorithm based on optimal private stopping rule. InInternational Conference on Machine Learning, pages 5579–5588. PMLR, 2019

2019

[30] [30]

Shariff and O

R. Shariff and O. Sheffet. Differentially private contextual linear bandits.Advances in Neural Information Processing Systems, 31, 2018

2018

[31] [31]

Sutton and A

R. Sutton and A. Barto.Reinforcement Learning: An Introduction. MIT Press, 2018

2018

[32] [32]

Tossou and C

A. Tossou and C. Dimitrakakis. Algorithms for differentially private multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016

2016

[33] [33]

Van Erven and P

T. Van Erven and P. Harremos. Rényi divergence and kullback-leibler divergence.IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

2014

[34] [34]

Vietri, B

G. Vietri, B. Balle, A. Krishnamurthy, and S. Wu. Private reinforcement learning with pac and regret guarantees. InInternational Conference on Machine Learning, pages 9754–9764. PMLR, 2020

2020

[35] [35]

Xiong, R

Z. Xiong, R. Shen, Q. Cui, M. Fazel, and S. S. Du. Near-optimal randomized exploration for tabular markov decision processes.Advances in neural information processing systems, 35:6358–6371, 2022

2022

[36] [36]

Zanette, D

A. Zanette, D. Brandfonbrener, E. Brunskill, M. Pirotta, and A. Lazaric. Frequentist regret bounds for randomized least-squares value iteration. InInternational Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR, 2020

1954

[37] [37]

Zhang, Z

K. Zhang, Z. Yang, and T. Ba¸ sar. Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021

2021

[38] [38]

Zheng, T

K. Zheng, T. Cai, W. Huang, Z. Li, and L. Wang. Locally differentially private (contextual) bandits learning.Advances in Neural Information Processing Systems, 33:12300–12310, 2020. 11 A Technical Appendices and Supplementary Material Proof of Lemma 6. Dα(Gσ(D)∥G σ(D′)) =Dα(N(f(D), σ 2)∥ N(f(D ′), σ2)) = α(f(D)−f(D ′))2 2σ2 ≤ α∆2 2σ2 The second equality h...

2020