AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
hub
Judging the judges: Evaluating alignment and vulner- abilities in llms-as-judges
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4representative citing papers
c-CRAB benchmark shows state-of-the-art code review agents solve only around 40% of tasks derived from human reviews, suggesting potential for human-AI collaboration.
NUTSHELL is a new open dataset of ACL talks paired with abstracts, accompanied by baselines that demonstrate training benefits for speech-to-abstract generation while highlighting remaining challenges.
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.
Derives backward and forward corrections for asymmetric verifier noise that improve RLVR performance on math reasoning tasks.
WildFeedback extracts preference pairs from in-situ user feedback in LLM conversations to fine-tune models for better alignment with real user preferences.
LLMs judging sustainable city-trip recommendations show model-specific biases and high variance across four dimensions even when overall rankings agree, with a three-phase human calibration process clarifying some reasoning but exposing disagreements on sustainability.
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
A parallel compliance architecture using multi-stage LLM retrieval improves correctness and reasoning quality over a baseline for OT cybersecurity compliance queries in a railway case study.
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
citing papers explorer
-
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
-
Code Review Agent Benchmark
c-CRAB benchmark shows state-of-the-art code review agents solve only around 40% of tasks derived from human reviews, suggesting potential for human-AI collaboration.
-
NUTSHELL: A Dataset for Abstract Generation from Scientific Talks
NUTSHELL is a new open dataset of ACL talks paired with abstracts, accompanied by baselines that demonstrate training benefits for speech-to-abstract generation while highlighting remaining challenges.
-
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
-
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls
Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
-
Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation
CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.
-
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
Derives backward and forward corrections for asymmetric verifier noise that improve RLVR performance on math reasoning tasks.
-
WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback
WildFeedback extracts preference pairs from in-situ user feedback in LLM conversations to fine-tune models for better alignment with real user preferences.
-
Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop
LLMs judging sustainable city-trip recommendations show model-specific biases and high variance across four dimensions even when overall rankings agree, with a three-phase human calibration process clarifying some reasoning but exposing disagreements on sustainability.
-
A Survey on LLM-as-a-Judge
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
-
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
-
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy
A parallel compliance architecture using multi-stage LLM retrieval improves correctness and reasoning quality over a baseline for OT cybersecurity compliance queries in a railway case study.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.