Examples of the first kind are the emerging family of algorithms for reasoning such as GRPO Shao et al

Unknownπ data PEPO(Ours)Open Question Optimism via EnsembleIn the context of LLM post-training there are applications in which at training time it is possible to generate sentences from the model which is currently learning, getting feed · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

cs.LG · 2026-02-05 · unverdicted · novelty 5.0 · 2 refs

PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.

citing papers explorer

Showing 1 of 1 citing paper.

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution cs.LG · 2026-02-05 · unverdicted · none · ref 44 · 2 links
PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.

Examples of the first kind are the emerging family of algorithms for reasoning such as GRPO Shao et al

fields

years

verdicts

representative citing papers

citing papers explorer