arxiv: 2605.07049 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Towards Differentially Private Reinforcement Learning with General Function Approximation

Yi He , Xingyu Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords differentially private reinforcement learninggeneral function approximationregret boundsexponential mechanismcoverability coefficientbatched updatesmodel-free RLonline reinforcement learning

0 comments

The pith

Differentially private RL achieves O(K^{3/5}) regret under general function approximation, matching the linear case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the first regret analysis for differentially private online reinforcement learning when using general function approximation instead of tabular or linear restrictions. It combines batched policy updates with the exponential mechanism and shows that the resulting regret scales as tilde-O of K to the 3/5 power. A sympathetic reader would care because this indicates privacy constraints need not degrade performance when moving to flexible function classes used in real applications. The work also supplies the first coverability-based regret bound for batched non-private RL and identifies inconsistencies in earlier linear analyses.

Core claim

The central claim is that, even under general function approximation, the regret in model-free differentially private online RL is tilde-O(K^{3/5}), where K is the number of episodes. This is obtained by a batched update scheme that invokes the exponential mechanism at batch boundaries, with the privacy noise controlled through the coverability coefficient of the function class. The same analysis yields the first regret bound for non-private batched RL that depends on coverability rather than Eluder dimension.

What carries the argument

Batched policy updates paired with the exponential mechanism, analyzed using the coverability coefficient of the function class to bound the additional regret from privacy noise.

If this is right

Regret scales as tilde-O(K^{3/5}) in the private model-free setting with general function approximation.
The bound matches the best known rate for the linear case.
Non-private batched RL obtains regret bounds expressed in terms of coverability.
Prior results on private linear RL contain gaps that the new analysis exposes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Coverability emerges as a practical complexity measure for designing privacy mechanisms in RL beyond linear settings.
The batching approach may extend to other privacy budgets or to model-based RL variants.
Clarifying the linear-case gaps could lead to tighter private analyses when function classes are restricted to linear.

Load-bearing premise

The function class must admit a finite coverability coefficient so the batched exponential mechanism can keep privacy noise from inflating the regret beyond the stated rate.

What would settle it

A concrete function class with finite coverability coefficient in which the observed regret under differential privacy grows faster than K to the 3/5 power.

Figures

Figures reproduced from arXiv: 2605.07049 by Xingyu Zhou, Yi He.

read the original abstract

We present the first theoretical guarantees for differentially private online reinforcement learning (RL) with general function approximation, extending beyond prior work restricted to tabular and linear settings. Our approach combines a batched policy update scheme with the exponential mechanism, together with a novel regret analysis. We show that, even under general function approximation, the regret in the model-free setting under differential privacy matches the state of the art for the linear case, scaling as $\widetilde{O}(K^{3/5})$, where $K$ denotes the number of episodes. As an important by-product, we also establish the first regret bound for online RL with batch update that depends on the standard complexity measure of coverability, complementing existing results based on a newly introduced Eluder-Condition class. In addition, we uncover fundamental gaps in recent results for private RL with linear function approximation, thereby clarifying its landscape.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first DP regret bounds for general function approximation RL and matches the linear O(K^{3/5}) rate under an explicit coverability assumption.

read the letter

The main thing to know is that they extend differential privacy to online RL with general function approximation and keep the regret at the same order as the linear case. They do this by running batched policy updates through the exponential mechanism and then bounding the extra privacy noise via the coverability coefficient of the function class. The analysis also yields the first coverability-dependent regret bound for any batched RL setting, and it includes clean counter-examples showing where some earlier linear private RL claims fall short. That combination of results is the real contribution. The central argument holds up: once you accept finite coverability (already required for the non-private batched bound), the privacy term gets absorbed without forcing a worse exponent or extra restrictions. The counter-examples are self-contained and do not touch the positive theorem. The weakest point is that the batching schedule and privacy budget still need to be tuned together, which could matter for anyone trying to implement it, but that is a practical detail rather than a load-bearing flaw in the math. The paper is aimed at researchers who work on theoretical RL with function approximation or on privacy constraints in sequential decision making. Anyone who has followed the linear and tabular DP-RL literature will see the direct extension and the clarification of prior gaps. It is worth sending to a serious referee because the claims are substantive, the assumptions are stated up front, and the technical steps appear reproducible from the given construction.

Referee Report

0 major / 3 minor

Summary. The paper presents the first theoretical guarantees for differentially private online reinforcement learning with general function approximation. It combines batched policy updates with the exponential mechanism to achieve a model-free regret bound of Õ(K^{3/5}) under differential privacy, matching the linear-function-approximation state of the art. As a byproduct, it derives the first regret bound for batched online RL that depends on the standard coverability coefficient (rather than Eluder dimension), and it identifies fundamental gaps in recent private linear-FA RL results.

Significance. If the central claims hold, the work meaningfully extends differentially private RL beyond tabular and linear settings, which is a substantial advance given the prevalence of general function approximation in practice. The fact that the privacy-induced term can be absorbed into the existing Õ(K^{3/5}) rate without worsening the exponent is a strong positive result. The coverability-based batched-regret bound is a useful technical contribution that complements existing Eluder-based analyses, and the counter-examples clarifying gaps in prior linear-FA work are self-contained and clarifying.

minor comments (3)

§3 (or wherever the coverability coefficient is first defined): the precise dependence of the batch size and privacy parameter on the coverability coefficient should be stated explicitly in the main theorem statement rather than only in the proof sketch, to make the result immediately usable by readers.
The counter-example section (likely §5 or §6) would benefit from a short table summarizing which prior linear-FA claims are invalidated and by what construction; this would improve readability without lengthening the paper.
Notation: the symbol for the coverability coefficient is introduced late; a single-line definition in the preliminaries would help readers who skip directly to the main theorem.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our contributions, as well as for recommending minor revision. The referee correctly highlights the first DP regret bound of order Õ(K^{3/5}) under general function approximation and the coverability-based batched regret bound. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via coverability analysis

full rationale

The claimed regret bound is obtained by combining batched policy updates with the exponential mechanism and performing a novel regret analysis that absorbs the privacy-induced additive term into the leading O(K^{3/5}) rate under the explicit finite-coverability assumption. This is not a fitted input renamed as prediction, nor a self-definitional reduction; the coverability coefficient is an independent complexity measure that also yields the by-product non-private batch-update bound. No load-bearing self-citations, uniqueness theorems imported from the authors, or ansatzes smuggled via prior work appear in the derivation chain. The clarification of gaps in prior linear-FA results is likewise self-contained and does not affect the main positive theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed from abstract only; therefore the ledger is necessarily incomplete. The paper implicitly relies on standard RL and DP background results rather than introducing new free parameters or entities.

axioms (2)

domain assumption The environment is a Markov decision process whose value or policy functions belong to a class with finite coverability.
Invoked to obtain the stated regret scaling under general function approximation.
standard math The exponential mechanism can be applied to policy selection while preserving the required privacy parameter.
Standard tool from differential privacy literature used in the batched update scheme.

pith-pipeline@v0.9.0 · 5438 in / 1406 out tokens · 45210 ms · 2026-05-11T01:13:16.180440+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Model-based reinforcement learning with value-targeted regression

[AJSWY20] A. Ayoub, Z. Jia, C. Szepesvari, M. Wang, and L. Yang. “Model-based reinforcement learning with value-targeted regression”. In:International Conference on Machine Learning. PMLR. 2020, pp. 463–474. 11 [AOM17] M. G. Azar, I. Osband, and R. Munos. “Minimax regret bounds for reinforcement learning”. In:International conference on machine learning. ...

work page 2020
[2]

Improved algorithms for linear stochastic bandits

[APS11] Y . Abbasi-Yadkori, D. Pál, and C. Szepesvári. “Improved algorithms for linear stochastic bandits”. In:Advances in neural information processing systems24 (2011). [ASM08] A. Antos, C. Szepesvári, and R. Munos. “Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path”. In:Machine Lear...

work page 2011
[3]

Differentially private empirical risk minimization

arXiv: 2505.20268 [cs.LG] . URL:https://arxiv.org/abs/2505.20268. [CMS11] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. “Differentially private empirical risk minimization.” In:Journal of Machine Learning Research12.3 (2011). [CSS11] T.-H. H. Chan, E. Shi, and D. Song. “Private and continual release of statistics”. In:ACM Transactions on Information and...

work page arXiv 2011
[4]

Shuffle private linear contextual bandits

2022, pp. 6375–6383. [CZ22b] S. R. Chowdhury and X. Zhou. “Shuffle private linear contextual bandits”. In:arXiv preprint arXiv:2202.05567(2022). [DKLLMSW21] S. Du, S. Kakade, J. Lee, S. Lovett, G. Mahajan, W. Sun, and R. Wang. “Bilinear classes: A structural framework for provable generalization in RL”. In:International Conference on Machine Learning. PML...

work page arXiv 2022
[5]

Exposing Privacy Gaps: Membership Inference Attack on Prefer- ence Data for LLM Alignment.CoRR abs/2407.06443,

Springer. 2006, pp. 265–284. [DRV10] C. Dwork, G. N. Rothblum, and S. Vadhan. “Boosting and differential privacy”. In: 2010 IEEE 51st annual symposium on foundations of computer science. IEEE. 2010, pp. 51–60. [FKKYTB24] Q. Feng, S. R. Kasa, S. K. Kasa, H. Yun, C. H. Teo, and S. B. Bodapati. “Exposing privacy gaps: Membership inference attack on preferenc...

work page arXiv 2006
[6]

Private Matchings and Allocations

[HHRRW16] J. Hsu, Z. Huang, A. Roth, T. Roughgarden, and Z. S. Wu. “Private Matchings and Allocations”. In:SIAM Journal on Computing45.6 (Jan. 2016), pp. 1953–1984.ISSN: 1095-7111.URL:http://dx.doi.org/10.1137/15100271X. [HZZBGY20] Y . Han, Z. Zhou, Z. Zhou, J. Blanchet, P. W. Glynn, and Y . Ye. “Sequential batch learn- ing in finite-action linear context...

work page doi:10.1137/15100271x 2016
[7]

GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards

[LZZSPCYHM25] Y . Liu, H. Zhang, J. Zheng, Z. Sun, Z. Peng, T. Cong, Y . Yang, X. He, and Z. Ma. “GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards”. In:arXiv preprint arXiv:2511.14045(2025). [MT07] F. McSherry and K. Talwar. “Mechanism design via differential privacy”. In:48th Annual IEEE Sympos...

work page internal anchor Pith review arXiv 2025
[8]

Training language models to follow instructions with human feedback

2014, pp. 1466–1474. [OWJAWMZASR+22] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. “Training language models to follow instructions with human feedback”. In:Advances in neural information processing systems35 (2022), pp. 27730–27744. [PWZLYS19] X. Pan, W. Wang, X. Zhang, B. Li, J. Yi, and...

work page 2014
[9]

Logarithmic switching cost in reinforcement learning beyond linear mdps.arXiv preprint arXiv:2302.12456,

2019, pp. 368–376. [QW23] D. Qiao and Y .-X. Wang. “Near-optimal differentially private reinforcement learning”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 9914–9940. [QYW23] D. Qiao, M. Yin, and Y .-X. Wang. “Logarithmic switching cost in reinforcement learning beyond linear mdps”. In:arXiv preprint arXiv:2302....

work page arXiv 2019
[10]

Towards Optimal Differentially Private Regret Bounds in Linear MDPs

2013, pp. 2256–2264. [Sah25] S. Sahu. “Towards Optimal Differentially Private Regret Bounds in Linear MDPs”. In:arXiv preprint arXiv:2504.09339(2025). [SJKAL19] W. Sun, N. Jiang, A. Krishnamurthy, A. Agarwal, and J. Langford. “Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches”. In:Conferenc...

work page arXiv 2013
[11]

Differentially private contextual linear bandits

[SS18] R. Shariff and O. Sheffet. “Differentially private contextual linear bandits”. In: Advances in Neural Information Processing Systems. 2018, pp. 4296–4306. [VBKW20] G. Vietri, B. Balle, A. Krishnamurthy, and S. Wu. “Private reinforcement learning with pac and regret guarantees”. In:International Conference on Machine Learning. PMLR. 2020, pp. 9754–9...

work page 2018
[12]

Provably efficient reinforcement learning with linear function approximation under adaptivity constraints

[WZG21] T. Wang, D. Zhou, and Q. Gu. “Provably efficient reinforcement learning with linear function approximation under adaptivity constraints”. In:Advances in Neural Information Processing Systems34 (2021), pp. 13524–13536. 14 [XFBJK22] T. Xie, D. J. Foster, Y . Bai, N. Jiang, and S. M. Kakade.The Role of Coverage in Online Reinforcement Learning

work page 2021
[13]

A general framework for sequential decision- making under adaptivity constraints

arXiv: 2210.04157 [cs.LG] .URL: https://arxiv.org/abs/2210.04157. [XWY23] N. Xiong, Z. Wang, and Z. Yang. “A general framework for sequential decision- making under adaptivity constraints”. In:arXiv preprint arXiv:2306.14468(2023). [YLNY21] C. Yu, J. Liu, S. Nemati, and G. Yin. “Reinforcement learning in healthcare: A survey”. In:ACM Computing Surveys (CS...

work page arXiv 2023
[14]

Differentially private reinforcement learning with linear function approxi- mation

arXiv: 2201 . 07052 [cs.LG].URL: https : / / arxiv . org/abs/2201.07052. [Zho22b] X. Zhou. “Differentially private reinforcement learning with linear function approxi- mation”. In:Proceedings of the ACM on Measurement and Analysis of Computing Systems6.1 (2022), pp. 1–27. [ZLKB20] A. Zanette, A. Lazaric, M. Kochenderfer, and E. Brunskill. “Learning near o...

work page arXiv 2022