arXiv preprint arXiv:2602.04417 , year=

Lunjun Zhang, Jimmy Ba · 2026 · arXiv 2602.04417

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.

EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability cs.LG · 2026-05-09 · unverdicted · none · ref 113
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games cs.LG · 2026-06-22 · unverdicted · none · ref 31
EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 91 · 2 links
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

arXiv preprint arXiv:2602.04417 , year=

fields

years

verdicts

representative citing papers

citing papers explorer