Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Relative entropy policy search , author=

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

cs.AI · 2026-05-20 · conditional · novelty 7.0

DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.

An adaptive variance estimator for relative sparsity

stat.ME · 2026-05-04 · unverdicted · novelty 6.0

A new adaptive variance estimator for relative sparsity coefficients is introduced that fully utilizes the prior asymptotic normality theorem and incorporates variable selection effects.

citing papers explorer

Showing 4 of 4 citing papers.

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment cs.AI · 2026-05-20 · conditional · none · ref 17
DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
Discrete Flow Matching for Offline-to-Online Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 56
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex cs.LG · 2026-05-07 · unverdicted · none · ref 47 · 2 links
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
An adaptive variance estimator for relative sparsity stat.ME · 2026-05-04 · unverdicted · none · ref 120
A new adaptive variance estimator for relative sparsity coefficients is introduced that fully utilizes the prior asymptotic normality theorem and incorporates variable selection effects.

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

fields

years

verdicts

representative citing papers

citing papers explorer