Log- arithmic regret for online kl-regularized reinforcement learning, 2025a

URL https://arxiv · arXiv 2502.07460

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Introduces KL misspecification for bandits and RL under function approximation and proves explicit KL-regret bounds for regression-based Gibbs algorithms that recover the realizable case.

Efficient Exploration for Iterative Nash Preference Optimization

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

An explicitly exploratory iterative NLHF method achieves O(sqrt(T)) regret for Nash equilibria under general preference models, removing the exponential KL dependence that plagues standard iterative approaches.

$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification cs.LG · 2026-06-04 · unverdicted · none · ref 25
Introduces KL misspecification for bandits and RL under function approximation and proves explicit KL-regret bounds for regression-based Gibbs algorithms that recover the realizable case.
Efficient Exploration for Iterative Nash Preference Optimization cs.LG · 2026-05-31 · unverdicted · none · ref 121
An explicitly exploratory iterative NLHF method achieves O(sqrt(T)) regret for Nash equilibria under general preference models, removing the exponential KL dependence that plagues standard iterative approaches.
$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses cs.LG · 2026-05-07 · unverdicted · none · ref 18
The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.

Log- arithmic regret for online kl-regularized reinforcement learning, 2025a

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer