Proposes LDB-DF and NDB-DF algorithms for contextual dueling bandits with delayed feedback using an IPW estimator in the loss, with O(d sqrt(T)) regret for the linear case and sub-linear guarantees for the neural case.
Neural dueling bandits
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it