TPMM-DPO applies trajectory-aware learned-weight merging of prior policy models to stabilize iterative DPO against preference noise accumulation.
Spread preference annotation: Direct preference judgment for efficient llm alignment
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
citing papers explorer
-
TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization
TPMM-DPO applies trajectory-aware learned-weight merging of prior policy models to stabilize iterative DPO against preference noise accumulation.
-
Generating Place-Based Compromises Between Two Points of View
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.