Feel-good thompson sam- pling for contextual dueling bandits.arXiv preprint arXiv:2404.06013

Xuheng Li, Heyang Zhao, Quanquan Gu · 2024 · arXiv 2404.06013

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Robust Linear Dueling Bandits with Post-serving Context under Unknown Delays and Adversarial Corruptions

cs.LG · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

Presents a robust algorithm for linear dueling bandits that achieves delay-regime-agnostic regret of order d(sqrt(T) + C + D) with additive costs for corruption and delay under post-serving contexts.

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

cs.LG · 2025-05-25 · unverdicted · novelty 7.0

ActiveDPO is a theoretically grounded active data selection method for sample-efficient LLM alignment that parameterizes the reward model directly with the LLM being aligned.

Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration

cs.LG · 2025-06-02 · unverdicted · novelty 6.0

Variance-aware neural dueling bandit algorithms achieve sublinear regret of order O(d sqrt(sum sigma_t^2) + sqrt(d T)) for wide networks on nonlinear utilities.

citing papers explorer

Showing 3 of 3 citing papers.

Robust Linear Dueling Bandits with Post-serving Context under Unknown Delays and Adversarial Corruptions cs.LG · 2026-05-03 · unverdicted · none · ref 5 · 2 links
Presents a robust algorithm for linear dueling bandits that achieves delay-regime-agnostic regret of order d(sqrt(T) + C + D) with additive costs for corruption and delay under post-serving contexts.
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment cs.LG · 2025-05-25 · unverdicted · none · ref 46
ActiveDPO is a theoretically grounded active data selection method for sample-efficient LLM alignment that parameterizes the reward model directly with the LLM being aligned.
Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration cs.LG · 2025-06-02 · unverdicted · none · ref 22
Variance-aware neural dueling bandit algorithms achieve sublinear regret of order O(d sqrt(sum sigma_t^2) + sqrt(d T)) for wide networks on nonlinear utilities.

Feel-good thompson sam- pling for contextual dueling bandits.arXiv preprint arXiv:2404.06013

fields

years

verdicts

representative citing papers

citing papers explorer