πθ is the policy, or the current model at every step

H DPO Training Details In all DPO experiments, we use the loss function introduced in (Rafailov et al · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Iterative Finetuning is Mostly Idempotent

cs.AI · 2026-05-01 · unverdicted · novelty 6.0

Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Iterative Finetuning is Mostly Idempotent cs.AI · 2026-05-01 · unverdicted · none · ref 19
Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.

πθ is the policy, or the current model at every step

fields

years

verdicts

representative citing papers

citing papers explorer