LLMs are applied in a generative pipeline for extracting, normalizing, and interpreting eligibility criteria from securities prospectuses, achieving up to 91% precision in document-level decisions with a conservative bias.
Aligning with human judgement: The role of pairwise preference in large language model evaluators
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
GRASP aggregates stable local LLM interaction judgments into global argument rankings via a convergent attack-defense propagation operator on interaction graphs, yielding higher reproducibility than holistic judging and no correlation with human convincingness.
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
Proposes compiling preference pairs into readable natural-language specifications for inference-time LLM alignment, claiming outperformance over DPO on dense-preference domains.
HoldUp uses LLM-guided clustering to provide holistic dataset context for semantic operators, yielding up to 33% higher classification accuracy and 30% higher scoring accuracy than row-by-row LLM processing across 15 datasets.
LLMs exhibit quality-dependent order biases and name biases in pairwise comparisons that can cause selection of inferior options, demonstrated across resume and color tasks with a new classification of preferences as robust, fragile, or indifferent.
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Fine-tuned multilingual LLMs achieve top shared-task scores on financial causality extraction in English and Spanish.
citing papers explorer
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.