Optimal Gap-Dependent Regret for Private Stochastic Decision-Theoretic Online Learning

Roberto Colomboni; Tommaso Cesari

arxiv: 2605.29148 · v1 · pith:ZP3P5BNFnew · submitted 2026-05-27 · 💻 cs.LG · stat.ML

Optimal Gap-Dependent Regret for Private Stochastic Decision-Theoretic Online Learning

Tommaso Cesari , Roberto Colomboni This is my paper

Pith reviewed 2026-06-29 13:21 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords differential privacyonline learningregret boundsgap-dependent regretstochastic decision-theoretic learningfull information feedbackpure DPevent-level privacy

0 comments

The pith

A horizon-free pure-DP algorithm achieves gap-dependent regret bounded by 1000 times (log K over Delta_min plus log K over epsilon) for all time horizons T in stochastic decision-theoretic online learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the optimal gap-dependent regret under pure event-level differential privacy matches a known lower bound of order log K over min(Delta_min, epsilon). It does so by constructing an algorithm whose total regret never exceeds 1000 times that quantity, independent of the horizon length. A sympathetic reader cares because the construction resolves an explicit open question on the precise rate achievable with privacy constraints. The bound separates the contribution from the action gap and from the privacy parameter into two additive terms, each multiplied by log K.

Core claim

We study stochastic decision-theoretic online learning with full information and event-level pure differential privacy. For K actions, losses in [0,1], and a unique best action separated from the second-best by gap Delta_min, the known lower bound is of order log K over min(Delta_min, epsilon). We give a horizon-free pure-DP algorithm and prove the explicit regret bound Reg_T <= 1000 * (log K / Delta_min + log K / epsilon) for every horizon T. The algorithm partitions time into blocks of exponentially increasing size, plays a single action throughout each block, and chooses the next action by an exponential mechanism applied to a data-independent random prefix of the previous block. The rand

What carries the argument

Block partitioning with exponentially growing sizes combined with an exponential mechanism on a data-independent random prefix of the prior block; the prefix turns each block's regret into a controllable sum of softmax errors whose total is bounded by an entropy potential.

If this is right

The achieved regret is independent of the time horizon T.
The privacy cost appears only as the additive term log K / epsilon and does not interact with the gap term.
The same block structure yields a pure-DP guarantee while preserving the non-private gap-dependent rate up to the constant factor.
The construction works for any sequence of loss vectors satisfying the unique-gap assumption.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The random-prefix technique may extend to other settings where one wants to decouple privacy noise from gap-dependent selection.
If the constant 1000 can be reduced, the same algorithmic skeleton would immediately improve the explicit bound.
The entropy-potential argument might apply to other full-information private online problems that admit a softmax error decomposition.

Load-bearing premise

The analysis assumes a unique best action separated from the second-best by a fixed gap Delta_min, losses bounded in [0,1], and that the random-prefix exponential mechanism converts block regret into a sum of softmax errors that an entropy potential can control.

What would settle it

An explicit instance with unique gap Delta_min where the algorithm's realized regret exceeds 1000 times (log K / Delta_min + log K / epsilon) on some run of length T, or a matching lower bound construction showing a strictly smaller universal constant is impossible.

read the original abstract

We study stochastic decision-theoretic online learning with full information and event-level pure differential privacy. A COLT open problem of Hu and Mehta asks to determine the optimal gap-dependent regret rate for stochastic decision-theoretic online learning under pure event-level differential privacy. For $K$ actions, losses in $[0,1]$, and a unique best action separated from the second-best action by gap $\Delta_{\min}$, the known lower bound is of order $ \frac{\log K}{\min\{\Delta_{\min},\varepsilon\}}, $ or equivalently, up to universal constants, of order \[ \frac{\log K}{\Delta_{\min}}+\frac{\log K}{\varepsilon}. \] We give a horizon-free pure-DP algorithm and prove the explicit regret bound \[ \operatorname{Reg}_T \le 1000 \cdot \left(\frac{\log K}{\Delta_{\min}}+\frac{\log K}{\varepsilon}\right) \] for every horizon $T$. The numerical constant is not optimized. The algorithm partitions time into blocks of exponentially increasing size, plays a single action throughout each block, and chooses the next action by an exponential mechanism applied to a data-independent random prefix of the previous block. The random prefix converts block regret into a sum, over all prefix lengths, of softmax selection errors. A single entropy-potential argument controls all privacy-dominated large-gap actions at cost $\log K/\varepsilon$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper closes the COLT open problem with the first explicit horizon-free pure-DP algorithm matching the gap-dependent lower bound order.

read the letter

The paper supplies the first algorithm and matching upper bound for the Hu-Mehta open question on gap-dependent regret under pure event-level DP. For K actions and unique best arm with gap Delta_min, it achieves Reg_T <= 1000 (log K / Delta_min + log K / epsilon) for any T, using exponentially growing blocks, a single action per block, and the exponential mechanism on a random prefix of the prior block.

The construction is new: earlier work had the lower bound but left the horizon-free upper bound open. The random-prefix trick converts per-block regret into a sum of softmax errors, and one entropy-potential argument is claimed to bound the privacy-dominated terms at total cost log K / epsilon without T factors. That is the central technical step, and the abstract states it works via a single potential.

The bound is explicit and the constant, while large, is consistent with an order-optimal result that was not previously available. The analysis assumes losses in [0,1] and a unique best action separated by Delta_min; those are standard but worth noting.

The main soft spot is verification of the potential argument itself. The stress-test concern about possible log T accumulation from summing over prefixes across Theta(log T) blocks is reasonable to raise from the abstract alone. The paper asserts the argument telescopes or resets to avoid that, but without the full derivation of the potential update and charging, it is hard to confirm the bound stays T-independent. The large constant also leaves room for tightening, though that is secondary.

This is for researchers in private online learning who care about tight gap-dependent rates. A reader working on DP regret bounds or COLT-style open problems gets direct value from the construction. It deserves serious referee time because it claims to resolve a stated open problem with a concrete algorithm and proof sketch.

Referee Report

1 major / 1 minor

Summary. The paper claims to resolve the COLT open problem of Hu and Mehta by giving a horizon-free pure event-level DP algorithm for stochastic decision-theoretic online learning. For K actions and losses in [0,1] with unique best action separated by gap Δ_min, it proves the explicit T-independent regret bound Reg_T ≤ 1000 ⋅ (log K / Δ_min + log K / ε) that matches the known lower bound up to the universal constant 1000. The algorithm partitions time into exponentially growing blocks, plays a single action per block, and selects the next action via the exponential mechanism applied to a data-independent random prefix of the prior block; a single entropy-potential argument is asserted to bound the total contribution of privacy-dominated actions by O(log K / ε).

Significance. If the central analysis holds, the result is significant: it supplies the first explicit horizon-free pure-DP algorithm achieving the optimal gap-dependent rate, directly answering the open question. The block-partitioning construction together with the random-prefix reduction and entropy-potential argument are technically notable for eliminating all log T factors while preserving pure DP.

major comments (1)

[Abstract] Abstract (final paragraph describing the algorithm): the claim that 'a single entropy-potential argument controls all privacy-dominated large-gap actions at cost log K/ε' is load-bearing for the T-independent bound. With Θ(log T) blocks of exponentially increasing size, the random-prefix construction produces a sum over an increasing number of prefix lengths; without an explicit definition of the potential function, its update rule across blocks, and the telescoping argument that prevents accumulation of log T factors, it is impossible to verify that the total privacy cost remains O(log K / ε) rather than O(log T ⋅ log K / ε).

minor comments (1)

[Abstract] The constant 1000 is explicitly stated as unoptimized; a short remark on whether the analysis yields a concrete (even if large) improvement path would aid readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and positive assessment of the significance of our work. We address the major comment regarding the entropy-potential argument below. We agree that additional details are needed for clarity and will revise the manuscript to include them.

read point-by-point responses

Referee: [Abstract] Abstract (final paragraph describing the algorithm): the claim that 'a single entropy-potential argument controls all privacy-dominated large-gap actions at cost log K/ε' is load-bearing for the T-independent bound. With Θ(log T) blocks of exponentially increasing size, the random-prefix construction produces a sum over an increasing number of prefix lengths; without an explicit definition of the potential function, its update rule across blocks, and the telescoping argument that prevents accumulation of log T factors, it is impossible to verify that the total privacy cost remains O(log K / ε) rather than O(log T ⋅ log K / ε).

Authors: We acknowledge that the abstract's description is high-level and that the full details of the entropy-potential argument are crucial for verifying the bound. In the revised version of the paper, we will expand the relevant section to explicitly define the entropy potential function, describe its update rule after each block, and provide the telescoping argument showing that the total contribution from privacy-dominated actions sums to O(log K / ε) independent of T. The argument relies on the fact that the random prefix ensures the selection error is averaged over prefixes, and the potential decreases sufficiently to offset the privacy cost without accumulation over the log T blocks due to the exponential block sizes and the uniform control over large-gap actions. revision: yes

Circularity Check

0 steps flagged

Block partitioning and single entropy-potential argument yield horizon-free bound with no self-referential reduction

full rationale

The paper derives the stated regret bound from an explicit block-partitioning construction (exponentially growing blocks) combined with an entropy-potential argument that is asserted to telescope the summed softmax errors over all prefix lengths and all blocks at total cost O(log K / ε). No equations are presented in which a fitted parameter is renamed as a prediction, a quantity is defined in terms of itself, or a load-bearing step reduces to a self-citation. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The result rests on standard definitions of pure DP and the usual stochastic gap assumption; the main addition is the algorithmic construction rather than new axioms or entities.

free parameters (1)

1000 = 1000
Universal constant chosen to absorb all factors in the entropy-potential and privacy analysis; not optimized.

axioms (2)

standard math Pure event-level differential privacy definition and properties of the exponential mechanism
Invoked to guarantee privacy while selecting actions from previous-block prefixes.
domain assumption Existence of a unique best action with fixed gap Δ_min and losses in [0,1]
Standard modeling assumption for gap-dependent stochastic regret analysis.

pith-pipeline@v0.9.1-grok · 5792 in / 1325 out tokens · 47197 ms · 2026-06-29T13:21:57.442227+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 2 canonical work pages

[1]

2013 , PAGES =

St´ ephane Boucheron, G´ abor Lugosi, and Pascal Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 02 2013. ISBN 9780199535255. doi: 10.1093/acprof:oso/9780199535255.001.0001. URLhttps://doi.org/ 10.1093/acprof:oso/9780199535255.001.0001

work page doi:10.1093/acprof:oso/9780199535255.001.0001 2013
[2]

Differential privacy

Cynthia Dwork. Differential privacy. automata, languages and programming, 2006

2006
[3]

A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

1997
[4]

Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

1963
[5]

Bingshan Hu and Nishant A. Mehta. Open problem: Optimal rates for stochastic decision- theoretic online learning under differentially privacy. In Shipra Agrawal and Aaron Roth, editors,Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 ofPro- ceedings of Machine Learning Research, pages 5330–5334. PMLR, 30 Jun–03 Jul 2024. URL https:/...

2024
[6]

Mehta, and Nidhi Hegde

Bingshan Hu, Zhiming Huang, Nishant A. Mehta, and Nidhi Hegde. Near-optimal algo- rithms for differentially private online learning in a stochastic environment.arXiv preprint arXiv:2102.07929, 2021

work page arXiv 2021
[7]

On minimaxity of follow the leader strategy in the stochastic setting

Wojciech Kot lowski. On minimaxity of follow the leader strategy in the stochastic setting. Theoretical Computer Science, 742:50–65, 2018

2018
[8]

Mechanism design via differential privacy

Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 94–103. IEEE, 2007

2007
[9]

On the optimality of the hedge algorithm in the stochastic regime.Journal of Machine Learning Research, 20(83):1–28, 2019

Jaouad Mourtada and St´ ephane Ga¨ ıffas. On the optimality of the hedge algorithm in the stochastic regime.Journal of Machine Learning Research, 20(83):1–28, 2019

2019
[10]

Improved regret in stochastic decision-theoretic online learning under differential privacy

Ruihan Wu and Yu-Xiang Wang. Improved regret in stochastic decision-theoretic online learning under differential privacy. In Matus Telgarsky and Jonathan Ullman, editors,Pro- ceedings of The 37th International Conference on Algorithmic Learning Theory, volume 313 ofProceedings of Machine Learning Research, pages 1–22. PMLR, 23–26 Feb 2026. URL https://pro...

2026

[1] [1]

2013 , PAGES =

St´ ephane Boucheron, G´ abor Lugosi, and Pascal Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 02 2013. ISBN 9780199535255. doi: 10.1093/acprof:oso/9780199535255.001.0001. URLhttps://doi.org/ 10.1093/acprof:oso/9780199535255.001.0001

work page doi:10.1093/acprof:oso/9780199535255.001.0001 2013

[2] [2]

Differential privacy

Cynthia Dwork. Differential privacy. automata, languages and programming, 2006

2006

[3] [3]

A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

1997

[4] [4]

Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

1963

[5] [5]

Bingshan Hu and Nishant A. Mehta. Open problem: Optimal rates for stochastic decision- theoretic online learning under differentially privacy. In Shipra Agrawal and Aaron Roth, editors,Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 ofPro- ceedings of Machine Learning Research, pages 5330–5334. PMLR, 30 Jun–03 Jul 2024. URL https:/...

2024

[6] [6]

Mehta, and Nidhi Hegde

Bingshan Hu, Zhiming Huang, Nishant A. Mehta, and Nidhi Hegde. Near-optimal algo- rithms for differentially private online learning in a stochastic environment.arXiv preprint arXiv:2102.07929, 2021

work page arXiv 2021

[7] [7]

On minimaxity of follow the leader strategy in the stochastic setting

Wojciech Kot lowski. On minimaxity of follow the leader strategy in the stochastic setting. Theoretical Computer Science, 742:50–65, 2018

2018

[8] [8]

Mechanism design via differential privacy

Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 94–103. IEEE, 2007

2007

[9] [9]

On the optimality of the hedge algorithm in the stochastic regime.Journal of Machine Learning Research, 20(83):1–28, 2019

Jaouad Mourtada and St´ ephane Ga¨ ıffas. On the optimality of the hedge algorithm in the stochastic regime.Journal of Machine Learning Research, 20(83):1–28, 2019

2019

[10] [10]

Improved regret in stochastic decision-theoretic online learning under differential privacy

Ruihan Wu and Yu-Xiang Wang. Improved regret in stochastic decision-theoretic online learning under differential privacy. In Matus Telgarsky and Jonathan Ullman, editors,Pro- ceedings of The 37th International Conference on Algorithmic Learning Theory, volume 313 ofProceedings of Machine Learning Research, pages 1–22. PMLR, 23–26 Feb 2026. URL https://pro...

2026