hub

Judging the judges: Evaluating alignment and vulner- abilities in llms-as-judges

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes · 2024 · arXiv 2406.12624

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 3 unclear 1

representative citing papers

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

cs.AI · 2026-04-17 · conditional · novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.

Code Review Agent Benchmark

cs.SE · 2026-03-24 · unverdicted · novelty 7.0

c-CRAB benchmark shows state-of-the-art code review agents solve only around 40% of tasks derived from human reviews, suggesting potential for human-AI collaboration.

NUTSHELL: A Dataset for Abstract Generation from Scientific Talks

cs.CL · 2025-02-24 · unverdicted · novelty 7.0

NUTSHELL is a new open dataset of ACL talks paired with abstracts, accompanied by baselines that demonstrate training benefits for speech-to-abstract generation while highlighting remaining challenges.

Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.

Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls

cs.CL · 2026-05-04 · unverdicted · novelty 6.0

Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.

Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

cs.CL · 2026-01-29 · unverdicted · novelty 6.0

CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

cs.LG · 2025-10-01 · unverdicted · novelty 6.0

Derives backward and forward corrections for asymmetric verifier noise that improve RLVR performance on math reasoning tasks.

WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback

cs.CL · 2024-08-28 · unverdicted · novelty 5.0

WildFeedback extracts preference pairs from in-situ user feedback in LLM conversations to fine-tune models for better alignment with real user preferences.

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

cs.AI · 2026-04-27 · unverdicted · novelty 4.0

LLMs judging sustainable city-trip recommendations show model-specific biases and high variance across four dimensions even when overall rankings agree, with a three-phase human calibration process clarifying some reasoning but exposing disagreements on sustainability.

A Survey on LLM-as-a-Judge

cs.CL · 2024-11-23 · unverdicted · novelty 4.0

A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

cs.SE · 2024-10-28 · unverdicted · novelty 4.0

A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.

Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy

cs.AI · 2025-04-18 · unverdicted · novelty 3.0

A parallel compliance architecture using multi-stage LLM retrieval improves correctness and reasoning quality over a baseline for OT cybersecurity compliance queries in a railway case study.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

citing papers explorer

Showing 13 of 13 citing papers.

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench cs.AI · 2026-04-17 · conditional · none · ref 22
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
Code Review Agent Benchmark cs.SE · 2026-03-24 · unverdicted · none · ref 22
c-CRAB benchmark shows state-of-the-art code review agents solve only around 40% of tasks derived from human reviews, suggesting potential for human-AI collaboration.
NUTSHELL: A Dataset for Abstract Generation from Scientific Talks cs.CL · 2025-02-24 · unverdicted · none · ref 6
NUTSHELL is a new open dataset of ACL talks paired with abstracts, accompanied by baselines that demonstrate training benefits for speech-to-abstract generation while highlighting remaining challenges.
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks cs.AI · 2026-05-11 · unverdicted · none · ref 27
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls cs.CL · 2026-05-04 · unverdicted · none · ref 54
Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation cs.CL · 2026-01-29 · unverdicted · none · ref 20
CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers cs.LG · 2025-10-01 · unverdicted · none · ref 28
Derives backward and forward corrections for asymmetric verifier noise that improve RLVR performance on math reasoning tasks.
WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback cs.CL · 2024-08-28 · unverdicted · none · ref 35
WildFeedback extracts preference pairs from in-situ user feedback in LLM conversations to fine-tune models for better alignment with real user preferences.
Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop cs.AI · 2026-04-27 · unverdicted · none · ref 9
LLMs judging sustainable city-trip recommendations show model-specific biases and high variance across four dimensions even when overall rankings agree, with a three-phase human calibration process clarifying some reasoning but exposing disagreements on sustainability.
A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 148
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap cs.SE · 2024-10-28 · unverdicted · none · ref 98
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy cs.AI · 2025-04-18 · unverdicted · none · ref 43
A parallel compliance architecture using multi-stage LLM retrieval improves correctness and reasoning quality over a baseline for OT cybersecurity compliance queries in a railway case study.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 223
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Judging the judges: Evaluating alignment and vulner- abilities in llms-as-judges

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer