With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.
Online iterative reinforce- ment learning from human feedback with general preference model
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
verdicts
UNVERDICTED 3representative citing papers
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
STILL-2 uses imitation of distilled long-form thoughts, multi-rollout exploration on difficult problems, and iterative self-improvement of the dataset to train reasoning models that reach competitive performance on three challenging benchmarks.
citing papers explorer
No citing papers match the current filters.