Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification

Haoyang Hong; Huazheng Wang; Quanquan Gu; Zichen Wang

arxiv: 2606.06053 · v1 · pith:WKDAXMUYnew · submitted 2026-06-04 · 💻 cs.LG

Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification

Haoyang Hong , Zichen Wang , Quanquan Gu , Huazheng Wang This is my paper

Pith reviewed 2026-06-28 02:27 UTC · model grok-4.3

classification 💻 cs.LG

keywords KL-regularized RLmodel misspecificationfunction approximationregret boundscontextual banditsepisodic RLGibbs policy updateshigh-probability bounds

0 comments

The pith

KL-regularized RL and bandits achieve high-probability regret bounds with explicit misspecification terms under function approximation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies KL-regularized contextual bandits and episodic reinforcement learning when the assumed model class does not perfectly contain the true environment. It defines KL misspecification measures for these settings and analyzes regression-based algorithms that use Gibbs policy updates. The resulting bounds are high-probability guarantees on KL-regret that contain additive terms measuring the size of the misspecification and that reduce exactly to the realizable-case bounds when misspecification vanishes.

Core claim

The authors establish high-probability KL-regret guarantees for regression-based algorithms with Gibbs updates in both contextual bandits and episodic RL under general function approximation, where the bounds explicitly include misspecification terms defined via KL divergence, and these bounds recover the realizable case as a special instance when misspecification is zero.

What carries the argument

KL misspecification formulations for contextual bandits and episodic RL that quantify deviation from realizability via KL divergence and enable regression-based analysis to produce explicit additive terms in the regret bounds.

If this is right

The same regression-plus-Gibbs algorithm works for both bandits and episodic RL with only the misspecification term changing between the two settings.
When the misspecification term is zero the bounds coincide with prior realizable KL-regularized guarantees.
The analysis holds with high probability and applies to general function classes rather than tabular or linear settings.
Explicit dependence on the misspecification level makes the degradation in performance quantifiable rather than catastrophic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be used to certify robustness of deployed KL-regularized agents by measuring empirical KL misspecification on held-out trajectories.
One could design adaptive variants that estimate the misspecification term online and adjust regularization strength accordingly.
Similar misspecification measures might extend to other regularizers or to offline RL settings where data is collected under a different policy.

Load-bearing premise

The KL misspecification can be defined and bounded in a way that allows the regression-based analysis with Gibbs updates to produce explicit additive terms in the regret bound for both bandits and episodic RL.

What would settle it

An instance of contextual bandits or episodic RL where the defined KL misspecification measure is small yet the observed KL-regret exceeds the stated bound by more than the additive misspecification term.

read the original abstract

We study KL-regularized contextual bandits and episodic reinforcement learning (RL) under general function approximation with model misspecification. Existing guarantees rely on realizability and therefore do not extend to misspecified models, where classical regret bounds may fail. This work introduces KL misspecification formulations for contextual bandits and episodic RL and analyzes regression-based algorithms with Gibbs policy updates. High-probability KL-regret guarantees with explicit misspecification terms are established, recovering the standard realizable KL-regularized setting as a special case.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper defines KL misspecification for bandits and episodic RL and derives explicit high-probability regret bounds that recover the realizable case.

read the letter

The main thing to know is that the authors introduce KL misspecification measures for contextual bandits and episodic RL under general function approximation, then analyze regression-based algorithms with Gibbs policy updates to obtain high-probability KL-regret bounds containing explicit additive terms for the misspecification level. The bounds reduce exactly to the standard realizable setting when that level is zero.

The work extends the literature by moving past realizability assumptions that most prior regret analyses require. The misspecification formulations are set up to stay compatible with the regression oracles and policy updates, which lets the additive terms appear without breaking the existing proof structure. The stress-test note confirms the definitions are supplied explicitly, the derivations go through directly, and no circularity or hidden assumptions surface in the argument.

The central claim holds up on the evidence given. The paper sticks to the theoretical guarantees and does not overclaim practicality.

Soft spots are limited. How readily the misspecification measure can be bounded or estimated in applications is left open, but that is a natural next question rather than a flaw in the current derivations. The high-probability nature of the bounds is a clear positive.

This paper is for researchers working on theoretical RL and bandits with function approximation who need to handle misspecification. Readers focused on regret analysis under relaxed assumptions will find the explicit terms and the clean recovery of the realizable case useful.

It deserves peer review. The derivations support the claims and the result addresses a gap relative to the realizability-focused literature.

Referee Report

0 major / 3 minor

Summary. The paper studies KL-regularized contextual bandits and episodic RL under general function approximation with model misspecification. It introduces explicit KL misspecification measures for both settings, analyzes regression-oracle-based algorithms that employ Gibbs policy updates, and derives high-probability KL-regret bounds containing additive misspecification terms; the realizable case is recovered exactly when the misspecification parameter is zero.

Significance. If the stated derivations hold, the work supplies the first explicit high-probability regret guarantees for KL-regularized RL under misspecification, a setting that is practically relevant because realizability rarely holds exactly. The recovery of the standard realizable bounds as a special case and the compatibility with regression oracles are concrete strengths that allow direct comparison with prior realizable analyses.

minor comments (3)

The abstract and introduction should state the precise form of the regression oracle (e.g., whether it returns a least-squares or log-loss minimizer) and the exact definition of the KL misspecification measure used in the bounds, as these are central to the claimed compatibility with Gibbs updates.
Notation for the misspecification parameter (denoted variously as ε or δ in the abstract) should be unified and its dependence on the function class made explicit in the main theorem statements.
The manuscript would benefit from a short table comparing the new misspecification-dependent terms with the corresponding realizable bounds from prior work, to make the additive penalty transparent.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work on KL misspecification under function approximation and for recommending minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines KL misspecification measures explicitly for contextual bandits and episodic RL, then derives high-probability regret bounds for regression-oracle algorithms with Gibbs updates that include additive misspecification terms. The realizable case is recovered exactly by setting the misspecification parameter to zero. This is a direct, non-reductive generalization of prior realizable analyses; no step reduces by construction to a fitted input, self-definition, or load-bearing self-citation chain. The provided abstract and reader summary confirm an independent derivation path without any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on the ability to formulate misspecification via KL terms and on the applicability of regression-based analysis to produce explicit bounds; no free parameters or invented entities are visible from the abstract.

axioms (2)

domain assumption General function approximation is compatible with regression oracles for value estimation
Abstract invokes regression-based algorithms under general function approximation.
domain assumption KL-regularized policy updates can be analyzed via standard concentration arguments once misspecification is defined
The high-probability bounds are claimed to follow from the new misspecification formulation.

pith-pipeline@v0.9.1-grok · 5611 in / 1334 out tokens · 41125 ms · 2026-06-28T02:27:34.762423+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 1 canonical work pages

[2]

Dylan Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, and Robert E

URL https://arxiv.org/abs/2405.19320. Dylan Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, and Robert E. Schapire. Practical contextual bandits with regression oracles. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research,

arXiv
[3]

URLhttps: //proceedings.mlr.press/v119/foster20a.html. Dylan J. Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspecification in contextual bandits.arXiv preprint arXiv:2107.05745, 2021b. URLhttps://arxiv.org/ abs/2107.05745. Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning wit...

arXiv
[4]

Sanath Kumar Krishnamurthy, Vitor Hadad, and Susan Athey

URLhttps://arxiv.org/abs/2102.00815. Sanath Kumar Krishnamurthy, Vitor Hadad, and Susan Athey. Adapting to misspecification in contextual bandits with offline regression oracles. InProceedings of the 38th International Con- ference on Machine Learning, volume 139 ofProceedings of Machine Learning Research,

arXiv
[6]

Orin Levy and Yishay Mansour

URL https://arxiv.org/abs/2602.23116. Orin Levy and Yishay Mansour. Optimal regret for policy optimization in contextual bandits.arXiv preprint arXiv:2602.13700,

Pith/arXiv arXiv
[7]

Orin Levy, Liad Erez, Alon Peled-Cohen, and Yishay Mansour

URLhttps://arxiv.org/abs/2602.13700. Orin Levy, Liad Erez, Alon Peled-Cohen, and Yishay Mansour. Regret bounds for adversarial contextual bandits with general function approximation and delayed feedback.arXiv preprint arXiv:2510.09127,

arXiv
[8]

Yunfan Li and Lin Yang

URLhttps://arxiv.org/abs/2510.09127. Yunfan Li and Lin Yang. On the model-misspecification in reinforcement learning. InProceed- ings of the 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research,

arXiv
[10]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

URL https://arxiv.org/abs/2312.00886. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to fol- low instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, pp. 27730–27744,

arXiv
[11]

Hao Qin and Chicheng Zhang

URLhttps:// arxiv.org/abs/2111.04850. Hao Qin and Chicheng Zhang. Taming the monster every context: Complexity measure and unified framework for offline-oracle efficient contextual bandits.arXiv preprint arXiv:2602.09456,

arXiv
[12]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D

URLhttps://arxiv.org/abs/2602.09456. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, pp. 53728–53741,

arXiv
[14]

org/abs/1707.06347

URLhttps://arxiv. org/abs/1707.06347. Ayano Takemura, Shinji Ito, and Junya Honda. A parameter-free algorithm for misspecified linear contextual bandits. InProceedings of the 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research,

Pith/arXiv arXiv
[15]

Zihan Wang, Boyi Liu, and Chi Jin

URLhttps://arxiv.org/ abs/2005.10804. Zihan Wang, Boyi Liu, and Chi Jin. Is RLHF more difficult than standard RL? a theoretical analysis and an empirical study.arXiv preprint arXiv:2306.14111,

arXiv 2005
[16]

Di Wu, Chengshuai Shi, Jing Yang, and Cong Shen

URLhttps://arxiv.org/ abs/2306.14111. Di Wu, Chengshuai Shi, Jing Yang, and Cong Shen. Greedy sampling is provably efficient for RLHF. arXiv preprint arXiv:2510.24700,

arXiv
[17]

Tianhao Wu and Wen Sun

URLhttps://arxiv.org/abs/2510.24700. Tianhao Wu and Wen Sun. Making reinforcement learning from human feedback efficient via randomization.arXiv preprint arXiv:2310.14554,

arXiv
[18]

Tengyang Xie, Dylan J

URLhttps://arxiv.org/abs/ 2310.14554. Tengyang Xie, Dylan J. Foster, Yu Bai, Nan Jiang, and Sham M. Kakade. The role of coverage in online reinforcement learning. InInternational Conference on Learning Representations,

arXiv
[19]

13 Tengyang Xie, Dylan J

URLhttps://arxiv.org/abs/2210.04157. 13 Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicitQ ⋆-approximation for sample-efficient RLHF.arXiv preprint arXiv:2405.21046,

arXiv
[20]

org/abs/2405.21046

URLhttps://arxiv. org/abs/2405.21046. Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for RLHF under KL-constraint. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning ...

arXiv
[21]

Chenlu Ye, Wei Xiong, Yuheng Zhang, Hanze Dong, Nan Jiang, and Tong Zhang

URLhttps://arxiv.org/abs/2006.08910. Chenlu Ye, Wei Xiong, Yuheng Zhang, Hanze Dong, Nan Jiang, and Tong Zhang. Online iterative reinforcement learning from human feedback with general preference model.arXiv preprint arXiv:2402.07314,

arXiv 2006
[22]

Tong Zhang.Mathematical Analysis of Machine Learning Algorithms

URLhttps://arxiv.org/abs/2402.07314. Tong Zhang.Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press, Cambridge,

arXiv
[23]

Heyang Zhao, Chenlu Ye, Quanquan Gu, and Tong Zhang

DOI: 10.1017/9781009093057. Heyang Zhao, Chenlu Ye, Quanquan Gu, and Tong Zhang. Sharp analysis for KL-regularized con- textual bandits and RLHF.arXiv preprint arXiv:2411.04625,

work page doi:10.1017/9781009093057
[24]

org/abs/2411.04625

URLhttps://arxiv. org/abs/2411.04625. Heyang Zhao, Chenlu Ye, Wei Xiong, Quanquan Gu, and Tong Zhang. Logarithmic regret for online KL-regularized reinforcement learning.arXiv preprint arXiv:2502.07460,

arXiv
[25]

14 Supplementary Materials The following content was not necessarily subject to peer review

URL https://arxiv.org/abs/2502.07460. 14 Supplementary Materials The following content was not necessarily subject to peer review. A Notation Symbol Meaning T, HNumber of rounds/episodes and episodic horizon length. δTarget failure probability in high-probability guarantees. ηKL-regularization parameter. λ, βLocalization parameter and optimism-bonus scale...

arXiv

[1] [2]

Dylan Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, and Robert E

URL https://arxiv.org/abs/2405.19320. Dylan Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, and Robert E. Schapire. Practical contextual bandits with regression oracles. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research,

arXiv

[2] [3]

URLhttps: //proceedings.mlr.press/v119/foster20a.html. Dylan J. Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspecification in contextual bandits.arXiv preprint arXiv:2107.05745, 2021b. URLhttps://arxiv.org/ abs/2107.05745. Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning wit...

arXiv

[3] [4]

Sanath Kumar Krishnamurthy, Vitor Hadad, and Susan Athey

URLhttps://arxiv.org/abs/2102.00815. Sanath Kumar Krishnamurthy, Vitor Hadad, and Susan Athey. Adapting to misspecification in contextual bandits with offline regression oracles. InProceedings of the 38th International Con- ference on Machine Learning, volume 139 ofProceedings of Machine Learning Research,

arXiv

[4] [6]

Orin Levy and Yishay Mansour

URL https://arxiv.org/abs/2602.23116. Orin Levy and Yishay Mansour. Optimal regret for policy optimization in contextual bandits.arXiv preprint arXiv:2602.13700,

Pith/arXiv arXiv

[5] [7]

Orin Levy, Liad Erez, Alon Peled-Cohen, and Yishay Mansour

URLhttps://arxiv.org/abs/2602.13700. Orin Levy, Liad Erez, Alon Peled-Cohen, and Yishay Mansour. Regret bounds for adversarial contextual bandits with general function approximation and delayed feedback.arXiv preprint arXiv:2510.09127,

arXiv

[6] [8]

Yunfan Li and Lin Yang

URLhttps://arxiv.org/abs/2510.09127. Yunfan Li and Lin Yang. On the model-misspecification in reinforcement learning. InProceed- ings of the 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research,

arXiv

[7] [10]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

URL https://arxiv.org/abs/2312.00886. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to fol- low instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, pp. 27730–27744,

arXiv

[8] [11]

Hao Qin and Chicheng Zhang

URLhttps:// arxiv.org/abs/2111.04850. Hao Qin and Chicheng Zhang. Taming the monster every context: Complexity measure and unified framework for offline-oracle efficient contextual bandits.arXiv preprint arXiv:2602.09456,

arXiv

[9] [12]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D

URLhttps://arxiv.org/abs/2602.09456. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, pp. 53728–53741,

arXiv

[10] [14]

org/abs/1707.06347

URLhttps://arxiv. org/abs/1707.06347. Ayano Takemura, Shinji Ito, and Junya Honda. A parameter-free algorithm for misspecified linear contextual bandits. InProceedings of the 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research,

Pith/arXiv arXiv

[11] [15]

Zihan Wang, Boyi Liu, and Chi Jin

URLhttps://arxiv.org/ abs/2005.10804. Zihan Wang, Boyi Liu, and Chi Jin. Is RLHF more difficult than standard RL? a theoretical analysis and an empirical study.arXiv preprint arXiv:2306.14111,

arXiv 2005

[12] [16]

Di Wu, Chengshuai Shi, Jing Yang, and Cong Shen

URLhttps://arxiv.org/ abs/2306.14111. Di Wu, Chengshuai Shi, Jing Yang, and Cong Shen. Greedy sampling is provably efficient for RLHF. arXiv preprint arXiv:2510.24700,

arXiv

[13] [17]

Tianhao Wu and Wen Sun

URLhttps://arxiv.org/abs/2510.24700. Tianhao Wu and Wen Sun. Making reinforcement learning from human feedback efficient via randomization.arXiv preprint arXiv:2310.14554,

arXiv

[14] [18]

Tengyang Xie, Dylan J

URLhttps://arxiv.org/abs/ 2310.14554. Tengyang Xie, Dylan J. Foster, Yu Bai, Nan Jiang, and Sham M. Kakade. The role of coverage in online reinforcement learning. InInternational Conference on Learning Representations,

arXiv

[15] [19]

13 Tengyang Xie, Dylan J

URLhttps://arxiv.org/abs/2210.04157. 13 Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicitQ ⋆-approximation for sample-efficient RLHF.arXiv preprint arXiv:2405.21046,

arXiv

[16] [20]

org/abs/2405.21046

URLhttps://arxiv. org/abs/2405.21046. Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for RLHF under KL-constraint. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning ...

arXiv

[17] [21]

Chenlu Ye, Wei Xiong, Yuheng Zhang, Hanze Dong, Nan Jiang, and Tong Zhang

URLhttps://arxiv.org/abs/2006.08910. Chenlu Ye, Wei Xiong, Yuheng Zhang, Hanze Dong, Nan Jiang, and Tong Zhang. Online iterative reinforcement learning from human feedback with general preference model.arXiv preprint arXiv:2402.07314,

arXiv 2006

[18] [22]

Tong Zhang.Mathematical Analysis of Machine Learning Algorithms

URLhttps://arxiv.org/abs/2402.07314. Tong Zhang.Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press, Cambridge,

arXiv

[19] [23]

Heyang Zhao, Chenlu Ye, Quanquan Gu, and Tong Zhang

DOI: 10.1017/9781009093057. Heyang Zhao, Chenlu Ye, Quanquan Gu, and Tong Zhang. Sharp analysis for KL-regularized con- textual bandits and RLHF.arXiv preprint arXiv:2411.04625,

work page doi:10.1017/9781009093057

[20] [24]

org/abs/2411.04625

URLhttps://arxiv. org/abs/2411.04625. Heyang Zhao, Chenlu Ye, Wei Xiong, Quanquan Gu, and Tong Zhang. Logarithmic regret for online KL-regularized reinforcement learning.arXiv preprint arXiv:2502.07460,

arXiv

[21] [25]

14 Supplementary Materials The following content was not necessarily subject to peer review

URL https://arxiv.org/abs/2502.07460. 14 Supplementary Materials The following content was not necessarily subject to peer review. A Notation Symbol Meaning T, HNumber of rounds/episodes and episodic horizon length. δTarget failure probability in high-probability guarantees. ηKL-regularization parameter. λ, βLocalization parameter and optimism-bonus scale...

arXiv