Larger differences in generator capability between chosen and rejected reasoning traces improve out-of-domain performance, while filtering pairs by sample-level quality deltas enables more data-efficient training.
URL https://proceedings.neurips.cc/paper files/paper/ 2024/file/2c487f8a54cf24c0684c32abc77fed56-Paper-Conference.pdf
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?
Larger differences in generator capability between chosen and rejected reasoning traces improve out-of-domain performance, while filtering pairs by sample-level quality deltas enables more data-efficient training.