arXiv preprint arXiv:2503.07453 , year=

· 2025 · arXiv 2503.07453

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Experiments indicate RL applied early in pre-training often matches full SFT-then-RL performance, targeted data composition outweighs scale for RL success, and averaging RL and SFT objectives outperforms sequential or single methods.

citing papers explorer

Showing 3 of 3 citing papers after filters.

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon cs.LG · 2026-06-29 · unverdicted · none · ref 16
Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability cs.LG · 2026-05-09 · unverdicted · none · ref 92
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training cs.LG · 2026-06-02 · unverdicted · none · ref 12
Experiments indicate RL applied early in pre-training often matches full SFT-then-RL performance, targeted data composition outweighs scale for RL success, and averaging RL and SFT objectives outperforms sequential or single methods.

arXiv preprint arXiv:2503.07453 , year=

fields

years

verdicts

representative citing papers

citing papers explorer