Variance-aware regret bounds for stochastic contextual dueling bandits

Qiwei Di, Tao Jin, Yue Wu, Heyang Zhao, Farzad Farnoud, Quanquan Gu · 2023 · arXiv 2310.00968

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions

cs.LG · 2026-04-24 · unverdicted · novelty 8.0

SupSplitLog achieves Õ(√(dT)) regret for logistic bandits without context diversity assumptions by splitting samples for an initial estimator and Newton correction, and can adapt to data-dependent bounds.

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

cs.LG · 2025-05-25 · unverdicted · novelty 7.0

ActiveDPO is a theoretically grounded active data selection method for sample-efficient LLM alignment that parameterizes the reward model directly with the LLM being aligned.

citing papers explorer

Showing 2 of 2 citing papers.

Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions cs.LG · 2026-04-24 · unverdicted · none · ref 4
SupSplitLog achieves Õ(√(dT)) regret for logistic bandits without context diversity assumptions by splitting samples for an initial estimator and Newton correction, and can adapt to data-dependent bounds.
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment cs.LG · 2025-05-25 · unverdicted · none · ref 45
ActiveDPO is a theoretically grounded active data selection method for sample-efficient LLM alignment that parameterizes the reward model directly with the LLM being aligned.

Variance-aware regret bounds for stochastic contextual dueling bandits

fields

years

verdicts

representative citing papers

citing papers explorer