Aligning with human judgement: The role of pairwise preference in large language model evaluators

Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vuli´c, Anna Korhonen, Nigel Collier · 2025 · arXiv 2403.16950

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

LLMs are applied in a generative pipeline for extracting, normalizing, and interpreting eligibility criteria from securities prospectuses, achieving up to 91% precision in document-level decisions with a conservative bias.

GRASP: Deterministic argument ranking in interaction graphs

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

GRASP aggregates stable local LLM interaction judgments into global argument rankings via a convergent attack-defense propagation operator on interaction graphs, yielding higher reproducibility than holistic judging and no correlation with human convincingness.

Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild

cs.SE · 2026-01-25 · conditional · novelty 7.0

Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.

Towards Spec Learning: Inference-Time Alignment from Preference Pairs

cs.CL · 2026-06-22 · unverdicted · novelty 6.0

Proposes compiling preference pairs into readable natural-language specifications for inference-time LLM alignment, claiming outperformance over DPO on dense-preference domains.

Semantic Data Processing with Holistic Data Understanding

cs.DB · 2026-04-03 · unverdicted · novelty 6.0

HoldUp uses LLM-guided clustering to provide holistic dataset context for semantic operators, yielding up to 33% higher classification accuracy and 30% higher scoring accuracy than row-by-row LLM processing across 15 datasets.

Fragile Preferences: A Deep Dive Into Order Effects in Large Language Models

cs.AI · 2025-06-17 · unverdicted · novelty 6.0

LLMs exhibit quality-dependent order biases and name biases in pairwise comparisons that can cause selection of inferior options, demonstrated across resume and color tasks with a new classification of preferences as robust, fragile, or indifferent.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Causal Connections: Leveraging Multilingual Fine-Tuning for Financial QA@FinCausal 2026

cs.CL · 2026-06-25 · unverdicted · novelty 2.0

Fine-tuned multilingual LLMs achieve top shared-task scores on financial causality extraction in English and Spanish.

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

cs.AI · 2026-04-25

citing papers explorer

Showing 1 of 1 citing paper after filters.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 154
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Aligning with human judgement: The role of pairwise preference in large language model evaluators

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer