Sequential Off-Policy Learning with Logarithmic Smoothing
Pith reviewed 2026-05-19 09:43 UTC · model grok-4.3
The pith
Sequential off-policy learning with logarithmic smoothing matches batch methods and outperforms them under repeated policy updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present and study a simple algorithm for sequential off-policy learning, combining Logarithmic Smoothing (LS) estimation with online PAC-Bayesian tools. We further show that a principled adjustment to LS improves performance and accelerates convergence under mild conditions. The algorithms introduced generalise previous work: they match state-of-the-art offline approaches in the batch case and substantially outperform them when policies are updated sequentially.
What carries the argument
Logarithmic Smoothing (LS) estimation paired with online PAC-Bayesian bounds, which together convert accumulating logged trajectories into stable policy updates across sequential deployments.
If this is right
- The algorithms recover existing state-of-the-art offline performance when all data is presented at once.
- Substantial gains appear specifically when the same policy is updated repeatedly on growing datasets.
- The adjustment to logarithmic smoothing yields both higher final returns and faster convergence under mild assumptions.
- The framework directly supports the real-world loop of training on past logs while collecting new data for the next round.
Where Pith is reading between the lines
- The same smoothing-plus-PAC-Bayesian combination could be tested on non-stationary environments where the data distribution itself drifts between updates.
- Extending the adjustment rule to other variance-reduction techniques such as weighted importance sampling might produce further gains in high-variance settings.
- Because the method is designed for repeated redeployment, it suggests a natural path toward lifelong policy improvement without resetting the data buffer each time.
Load-bearing premise
A principled adjustment to logarithmic smoothing improves performance and accelerates convergence under mild conditions.
What would settle it
An experiment on standard benchmarks where the sequential version shows no improvement over its batch counterpart or where the adjustment to LS fails to accelerate convergence would falsify the central performance claim.
Figures
read the original abstract
Off-policy learning enables training policies from logged interaction data. Most prior work considers the batch setting, where a policy is learned from data generated by a single behavior policy. In real systems, however, policies are updated and redeployed repeatedly, each time training on all previously collected data while generating new interactions for future updates. This sequential off-policy learning setting is common in practice but remains largely unexplored theoretically. In this work, we present and study a simple algorithm for sequential off-policy learning, combining Logarithmic Smoothing (LS) estimation with online PAC-Bayesian tools. We further show that a principled adjustment to LS improves performance and accelerates convergence under mild conditions. The algorithms introduced generalise previous work: they match state-of-the-art offline approaches in the batch case and substantially outperform them when policies are updated sequentially. Empirical evaluations highlight both the benefits of the sequential framework and the strength of the proposed algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents algorithms for sequential off-policy learning that combine logarithmic smoothing (LS) estimation with online PAC-Bayesian tools. It further introduces a principled adjustment to LS claimed to improve performance and accelerate convergence under mild conditions. The algorithms are asserted to generalize prior work by matching state-of-the-art offline methods in the batch case while substantially outperforming them under sequential policy updates, with empirical evaluations highlighting benefits of both the sequential framework and the proposed methods.
Significance. If the PAC-Bayesian generalization bounds hold under adaptive behavior policies, the work would meaningfully advance off-policy learning by addressing the practically relevant sequential setting. The empirical demonstration of outperformance when policies are iteratively updated provides concrete evidence of practical utility and could guide deployment in systems with repeated data collection and redeployment.
major comments (1)
- [Theoretical analysis (§4)] The claim of substantial outperformance in the sequential regime (Abstract and §5) rests on the online PAC-Bayesian analysis controlling cumulative deviation when each new batch is generated by the just-updated policy. Standard i.i.d. or slowly-varying assumptions are insufficient here; the proof must explicitly invoke a martingale concentration argument (e.g., Freedman's inequality) to handle the policy-induced dependence. If this step is missing or relies only on the usual batch-style bounds, the theoretical justification for sequential improvement is incomplete and the reported gains may not generalize beyond the specific experimental regime.
minor comments (2)
- [Abstract] The abstract states that the LS adjustment improves performance 'under mild conditions' but does not enumerate those conditions; a brief parenthetical or reference to the relevant theorem would improve clarity.
- [Experiments (§5)] Experimental details on how data from previous rounds are retained or filtered when forming the cumulative dataset would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. The comment on the theoretical analysis is well taken, and we address it directly below.
read point-by-point responses
-
Referee: [Theoretical analysis (§4)] The claim of substantial outperformance in the sequential regime (Abstract and §5) rests on the online PAC-Bayesian analysis controlling cumulative deviation when each new batch is generated by the just-updated policy. Standard i.i.d. or slowly-varying assumptions are insufficient here; the proof must explicitly invoke a martingale concentration argument (e.g., Freedman's inequality) to handle the policy-induced dependence. If this step is missing or relies only on the usual batch-style bounds, the theoretical justification for sequential improvement is incomplete and the reported gains may not generalize beyond the specific experimental regime.
Authors: We agree that the sequential setting introduces policy-induced dependence that requires careful handling beyond standard i.i.d. assumptions. Section 4 develops the analysis within an online PAC-Bayesian framework that is formulated precisely for adaptive, sequentially generated data. The proof applies a martingale concentration inequality to control the cumulative deviation across rounds where each new batch is produced by the updated policy. To improve clarity and explicitly address the referee's concern, we will revise the presentation in §4 to name Freedman's inequality, state the martingale difference sequence, and highlight how the bound applies under the adaptive behavior policy. This revision will not alter the results but will make the justification for the sequential outperformance more transparent. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained against external benchmarks
full rationale
The paper presents algorithms that combine logarithmic smoothing estimation with online PAC-Bayesian tools and a principled adjustment to LS, claiming generalization of prior offline work to the sequential setting. No load-bearing step reduces by construction to a fitted parameter renamed as prediction, nor does any central result depend on a self-citation chain whose cited theorem is itself unverified within the paper. The batch-case matching to SOTA and sequential outperformance are positioned as empirical and theoretical consequences of the new combination rather than tautological re-derivations of inputs. The derivation chain therefore stays independent of the target claims.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
P. Alquier. User-friendly Introduction to PAC-Bayes Bounds. Foundations and Trends® in Machine Learning, 2024
work page 2024
-
[2]
R. Andreeva, B. Dupuis, R. Sarkar, T. Birdal, and U. ¸ Sim¸ sekli. Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms. In Advances in Neural Informa- tion Processing Systems (NeurIPS), 2024
work page 2024
-
[3]
I. Aouali, V .-E. Brunel, D. Rohde, and A. Korba. Exponential Smoothing for Off-Policy Learning. In Proceedings of the 40th International Conference on Machine Learning, pages 984–1017. PMLR, 2023
work page 2023
-
[4]
I. Aouali, V .-E. Brunel, D. Rohde, and A. Korba. Unified PAC-Bayesian study of pessimism for offline policy learning with regularized importance sampling. In Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, UAI ’24. JMLR.org, 2024
work page 2024
-
[5]
A. Bibaut, N. Kallus, M. Dimakopoulou, A. Chambaz, and M. van der Laan. Risk minimization from adaptively collected data: guarantees for supervised and policy learning. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY , USA, 2021. Curran Associates Inc
work page 2021
- [6]
-
[7]
O. Catoni. A PAC-Bayesian approach to adaptive classification. preprint, 840, 2003
work page 2003
-
[8]
O. Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Institute of Mathematical Statistics, 2007
work page 2007
-
[9]
B.-E. Chérief-Abdellatif, Y . Shi, A. Doucet, and B. Guedj. On PAC-Bayesian reconstruction guarantees for V AEs. InProceedings of The 25th International Conference on Artificial Intelli- gence and Statistics [AISTATS], volume 151 of Proceedings of Machine Learning Research, pages 3066–3079. PMLR, 28–30 Mar 2022
work page 2022
-
[10]
N. Chopin and O. Papaspiliopoulos. An introduction to Sequential Monte Carlo / Nicolas Chopin, Omiros Papaspiliopoulos. Springer Series in Statistics. Springer, Cham, Switzerland, 1st ed. 2020. edition, 2020
work page 2020
- [11]
-
[12]
I. Csiszár. I-Divergence Geometry of Probability Distributions and Minimization Problems. The Annals of Probability, 1975
work page 1975
-
[13]
M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain Markov process expectations for large time—III. Communications on Pure and Applied Mathematics, 1976
work page 1976
-
[14]
J. Doob. Jean Ville, Étude Critique de la Notion de Collectif. Bulletin of the American mathematical society, 45(11):824–824, 1939
work page 1939
- [15]
-
[16]
G. K. Dziugaite and D. Roy. Computing Nonvacuous Generalization Bounds for Deep (Stochas- tic) Neural Networks with Many More Parameters than Training Data. In Conference on Uncertainty in Artificial Intelligence (UAI), 2017. 10
work page 2017
-
[17]
M. M. Fard and J. Pineau. PAC-Bayesian model selection for reinforcement learning. In Conference on Neural Information Processing Systems (NeurIPS), 2010
work page 2010
-
[18]
G. Gabbianelli, G. Neu, and M. Papini. Importance-weighted offline learning done right. In Proceedings of The 35th International Conference on Algorithmic Learning Theory, volume 237 of Proceedings of Machine Learning Research, pages 614–634. PMLR, 25–28 Feb 2024
work page 2024
-
[19]
B. Guedj. A Primer on PAC-Bayesian Learning. In Proceedings of the second congress of the French Mathematical Society, 2019
work page 2019
-
[20]
M. Haddouche and B. Guedj. Online PAC-Bayes Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[21]
M. Haddouche and B. Guedj. PAC-Bayes Generalisation Bounds for Heavy-Tailed Losses through Supermartingales. Transactions on Machine Learning Research, 2023
work page 2023
-
[22]
M. Haddouche and B. Guedj. Wasserstein PAC-Bayes Learning: Exploiting Optimisation Guarantees to Explain Generalisation. 2023
work page 2023
-
[23]
F. Hellström, G. Durisi, B. Guedj, and M. Raginsky. Generalization bounds: Perspectives from information theory and PAC-Bayes. arXiv preprint arXiv:2309.04381, 2023
-
[24]
D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952
work page 1952
-
[25]
Y . Hu, N. Kallus, and X. Mao. Fast rates for contextual linear optimization. Manage. Sci., 68(6):4236–4245, June 2022
work page 2022
-
[26]
Y . Jin, Z. Yang, and Z. Wang. Is pessimism provably efficient for offline rl? InInternational Conference on Machine Learning, pages 5084–5096. PMLR, 2021
work page 2021
- [27]
-
[28]
N. Kallus and A. Zhou. Policy evaluation and optimization with continuous treatments. In International conference on artificial intelligence and statistics , pages 1243–1251. PMLR, 2018
work page 2018
-
[29]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[30]
L. Li, B. Guedj, and S. Loustau. A quasi-Bayesian perspective to online clustering. Electron. J. Statist., 2018
work page 2018
-
[31]
B. London and T. Sandler. Bayesian counterfactual risk minimization. In International Conference on Machine Learning, pages 4125–4133. PMLR, 2019
work page 2019
-
[32]
D. A. McAllester. Some PAC-Bayesian theorems. In Proceedings of the eleventh annual conference on Computational Learning Theory, pages 230–234. ACM, 1998
work page 1998
-
[33]
D. A. McAllester. PAC-Bayesian model averaging. In Proceedings of the twelfth annual conference on Computational Learning Theory, pages 164–170. ACM, 1999
work page 1999
-
[34]
D. A. McAllester. PAC-Bayesian Stochastic Model Selection. Machine Learning, 2003
work page 2003
-
[35]
A. M. Metelli, A. Russo, and M. Restelli. Subgaussian and differentiable importance sampling for off-policy evaluation and learning. Advances in Neural Information Processing Systems, 34:8119–8132, 2021
work page 2021
-
[36]
W. Mou, L. Wang, X. Zhai, and K. Zheng. Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints. In Conference On Learning Theory (COLT), 2018
work page 2018
- [37]
-
[38]
A. B. Owen. Monte Carlo theory, methods and examples. https://artowen.su.domains/ mc/, 2013
work page 2013
-
[39]
M. Perez-Ortiz, O. Rivasplata, J. Shawe-Taylor, and C. Szepesvari. Tighter Risk Certificates for Neural Networks. Journal of Machine Learning Research, 2021
work page 2021
-
[40]
A. Reisizadeh, F. Farnia, R. Pedarsani, and A. Jadbabaie. Robust Federated Learning: The Case of Affine Distribution Shifts. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020
work page 2020
- [41]
- [42]
-
[43]
O. Sakhi, S. Bonner, D. Rohde, and F. Vasile. BLOB: A Probabilistic Model for Recommen- dation that Combines Organic and Bandit Signals. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 783–793, New York, NY , USA, 2020. Association for Computing Machinery
work page 2020
- [44]
-
[45]
J. Shawe-Taylor and R. C. Williamson. A PAC analysis of a Bayes estimator. In Proceedings of the 10th annual conference on Computational Learning Theory, pages 2–9. ACM, 1997
work page 1997
-
[46]
Y . Su, M. Dimakopoulou, A. Krishnamurthy, and M. Dudík. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning, pages 9167–9176. PMLR, 2020
work page 2020
-
[47]
A. Swaminathan and T. Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research, 16(1):1731– 1755, 2015
work page 2015
- [48]
-
[49]
R. Zhan, Z. Ren, S. Athey, and Z. Zhou. Policy learning with adaptively collected data, 2022. 12 A Limitations This work develops theoretically grounded and practical learning approaches for the adaptive con- textual bandit setting, where the decision-maker can dynamically adjust behavior policies to collect higher-quality data. Our theoretical analysis g...
work page 2022
-
[50]
Eπj " dθ(a|x) πj(a|x) − 1 2 c2 ## = 1 1 − λ Eπj
but also blurs the notion of prior and posterior distributions, now independent of the fundamental Bayes formula. This flexibility allowed PAC-Bayes to reach various sub-fields of learning theory: optimization dynamics of learning algorithms [36, 22, 2], reinforcement learning [17], online learning [30, 20], contrastive learning [37], generative models [9...
-
[51]
with a learning rate of 10−1 for 10 epochs. Once it is trained, we use an inverse temperature parameter α on its score to interpolate between a uniform policy α = 0 and a trained policy α = 1. Optimizing our learning objectives. In each optimization subroutine, we use Adam [29] with a learning rate of 10−3 for 10 epochs. The gradient of LIG policies is a ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.