super hub Mixed citations

A Survey on LLM-as-a-Judge

Chengjin Xu, Hexiang Tan, Jiawei Gu, Xuehao Zhai, Xuhui Jiang, Zhichao Shi · 2024 · cs.CL · arXiv 2411.15594

Mixed citation behavior. Most common role is background (70%).

114 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 114 citing papers more from Chengjin Xu arXiv PDF

abstract

Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability, including improving consistency, mitigating biases, and adapting to diverse assessment scenarios. Additionally, we propose methodologies for evaluating the reliability of LLM-as-a-Judge systems, supported by a novel benchmark designed for this purpose. To advance the development and real-world deployment of LLM-as-a-Judge systems, we also discussed practical applications, challenges, and future directions. This survey serves as a foundational reference for researchers and practitioners in this rapidly evolving field.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 18 method 5

citation-polarity summary

background 16 use method 5 unclear 2

claims ledger

abstract Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of L

authors

Chengjin Xu Hexiang Tan Jiawei Gu Xuehao Zhai Xuhui Jiang Zhichao Shi

co-cited works

representative citing papers

FollowTable: A Benchmark for Instruction-Following Table Retrieval

cs.IR · 2026-05-01 · unverdicted · novelty 8.0

FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

cs.AI · 2026-04-20 · accept · novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmentation yielding up to 12% gains.

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

cs.CL · 2025-07-28 · accept · novelty 8.0

MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.

GS-QA: A Benchmark for Geospatial Question Answering

cs.DB · 2026-05-21 · unverdicted · novelty 7.0

GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)

cs.CR · 2026-05-19 · accept · novelty 7.0

Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

ConsumerSimBench evaluates 13 LLMs on reconstructing crowd reactions from 1,553 Chinese social-media topics using 23,122 auditable yes-no criteria, finding maximum coverage of 47.8% by Gemini-3.1-Pro.

Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

CBEA with LCV bounds evidence sets and validates commitments before response generation, achieving zero failures in scoped tests at 0.49-0.60 availability versus near-zero for baselines.

Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

cond-mat.stat-mech · 2026-05-11 · unverdicted · novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.

StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

cs.CY · 2026-05-11 · accept · novelty 7.0 · 2 refs

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.

Task-Aware Calibration: Provably Optimal Decoding in LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.

BIM Information Extraction Through LLM-based Adaptive Exploration

cs.CL · 2026-05-03 · unverdicted · novelty 7.0

LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.

PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

cs.CL · 2026-04-28 · unverdicted · novelty 7.0

Depression patient simulators produce overly long, low-variability responses that resolve emotions too quickly along a uniform trajectory, with framework choice outweighing model scale.

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.

Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA

cs.CL · 2026-04-24 · conditional · novelty 7.0

MuDABench provides 332 analytical QA instances over large semi-structured document collections, showing standard RAG performs poorly while a multi-agent workflow with planning, extraction, and code generation improves results but leaves a gap to human experts.

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic perturbations.

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

cs.AI · 2026-04-17 · conditional · novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.

LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

A controlled LLM pipeline generates synthetic French OSCE transcripts with varying skill levels and evaluates them, with mid-size models achieving ~90% accuracy matching GPT-4o on the synthetic data.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models

cs.CL · 2026-03-27 · unverdicted · novelty 7.0

PR-CAD unifies text-to-CAD generation and editing via progressive refinement with LLMs, a new interaction dataset, and RL-enhanced reasoning to achieve better controllability and faithfulness.

When Negation Is a Geometry Problem in Vision-Language Models

cs.CV · 2026-03-20 · conditional · novelty 7.0

A direction associated with negation exists in CLIP embedding space and can be steered at test time via representation engineering to produce negation-aware outputs without fine-tuning.

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

cs.CL · 2026-03-20 · conditional · novelty 7.0

Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.

citing papers explorer

Showing 50 of 114 citing papers.

FollowTable: A Benchmark for Instruction-Following Table Retrieval cs.IR · 2026-05-01 · unverdicted · none · ref 14 · internal anchor
FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval cs.AI · 2026-04-20 · accept · none · ref 48 · internal anchor
MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmentation yielding up to 12% gains.
GIANTS: Generative Insight Anticipation from Scientific Literature cs.CL · 2026-04-10 · unverdicted · none · ref 5 · internal anchor
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation cs.CL · 2025-07-28 · accept · none · ref 8 · internal anchor
MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.
GS-QA: A Benchmark for Geospatial Question Answering cs.DB · 2026-05-21 · unverdicted · none · ref 20 · internal anchor
GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.
Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025) cs.CR · 2026-05-19 · accept · none · ref 25 · internal anchor
Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.
Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench cs.CL · 2026-05-16 · unverdicted · none · ref 19 · internal anchor
ConsumerSimBench evaluates 13 LLMs on reconstructing crowd reactions from 1,553 Chinese social-media topics using 23,122 auditable yes-no criteria, finding maximum coverage of 47.8% by Gemini-3.1-Pro.
Recall Isn't Enough: Bounding Commitments in Personalized Language Systems cs.AI · 2026-05-15 · unverdicted · none · ref 3 · internal anchor
CBEA with LCV bounds evidence sets and validates commitments before response generation, achieving zero failures in scoped tests at 0.49-0.60 availability versus near-zero for baselines.
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics cond-mat.stat-mech · 2026-05-11 · unverdicted · none · ref 20 · internal anchor
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs cs.CY · 2026-05-11 · accept · none · ref 48 · 2 links · internal anchor
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
Task-Aware Calibration: Provably Optimal Decoding in LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 13 · internal anchor
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
BIM Information Extraction Through LLM-based Adaptive Exploration cs.CL · 2026-05-03 · unverdicted · none · ref 50 · internal anchor
LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation cs.SE · 2026-04-29 · unverdicted · none · ref 11 · internal anchor
ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators cs.CL · 2026-04-28 · unverdicted · none · ref 4 · internal anchor
Depression patient simulators produce overly long, low-variability responses that resolve emotions too quickly along a uniform trajectory, with framework choice outweighing model scale.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models cs.CV · 2026-04-27 · unverdicted · none · ref 14 · internal anchor
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.
Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA cs.CL · 2026-04-24 · conditional · none · ref 2 · internal anchor
MuDABench provides 332 analytical QA instances over large semi-structured document collections, showing standard RAG performs poorly while a multi-agent workflow with planning, extraction, and code generation improves results but leaves a gap to human experts.
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring cs.CL · 2026-04-20 · unverdicted · none · ref 47 · internal anchor
LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic perturbations.
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge cs.CL · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench cs.AI · 2026-04-17 · conditional · none · ref 9 · internal anchor
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs cs.CL · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
A controlled LLM pipeline generates synthetic French OSCE transcripts with varying skill levels and evaluates them, with mid-size models achieving ~90% accuracy matching GPT-4o on the synthetic data.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 27 · internal anchor
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models cs.CL · 2026-03-27 · unverdicted · none · ref 5 · internal anchor
PR-CAD unifies text-to-CAD generation and editing via progressive refinement with LLMs, a new interaction dataset, and RL-enhanced reasoning to achieve better controllability and faithfulness.
When Negation Is a Geometry Problem in Vision-Language Models cs.CV · 2026-03-20 · conditional · none · ref 4 · internal anchor
A direction associated with negation exists in CLIP embedding space and can be steered at test time via representation engineering to produce negation-aware outputs without fine-tuning.
Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis cs.CL · 2026-03-20 · conditional · none · ref 27 · internal anchor
Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents cs.LG · 2026-03-13 · unverdicted · none · ref 9 · internal anchor
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling cs.SE · 2026-01-21 · unverdicted · none · ref 23 · internal anchor
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
VIDEOP2R: Video Understanding from Perception to Reasoning cs.CV · 2025-11-14 · conditional · none · ref 16 · internal anchor
VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.
When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models cs.AI · 2025-10-24 · unverdicted · none · ref 4 · internal anchor
Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs cs.CL · 2025-10-10 · unverdicted · none · ref 8 · internal anchor
FinAuditing is a taxonomy-structured multi-document benchmark with 1,102 instances averaging over 33k tokens from XBRL filings, defining three tasks to evaluate LLMs on financial auditing capabilities.
Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models cs.CL · 2025-07-05 · conditional · none · ref 8 · internal anchor
Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.
Bayesian Social Deduction with Graph-Informed Language Models cs.AI · 2025-06-21 · unverdicted · none · ref 13 · internal anchor
Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example cs.LG · 2025-04-29 · accept · none · ref 61 · internal anchor
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR cs.CV · 2025-04-15 · conditional · none · ref 18 · internal anchor
Consensus Entropy measures inter-VLM output agreement to verify OCR reliability and enable self-improving ensembles, yielding 42.1% F1 gains over single-model judging.
Membership Inference Attacks for Retrieval Based In-Context Learning for Document Question Answering cs.CR · 2026-05-05 · unverdicted · none · ref 19
Black-box membership inference attacks on retrieval-based in-context learning for document QA succeed via query prefixes, with a novel weighted-averaging method outperforming priors even under paraphrasing.
Towards Context-Invariant Safety Alignment for Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 80 · internal anchor
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 62 · internal anchor
Evaluation of 6233 MedGPTs finds 25-30% with low factual accuracy, 33.6-54.3% violating operational thresholds, and 57% of action-enabled models lacking privacy disclosures.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents cs.CL · 2026-05-14 · unverdicted · none · ref 269 · internal anchor
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines cs.SE · 2026-05-14 · accept · none · ref 45 · internal anchor
A paraphrase-robust clustering pipeline plus XGBoost classifier identifies refactoring-worthy step subsequences in large BDD test corpora with out-of-fold F1 0.891, outperforming rule baselines and LLM judges.
Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems cs.AI · 2026-05-14 · unverdicted · none · ref 27 · internal anchor
HEAR uses a stratified hypergraph ontology to orchestrate evidence-driven multi-hop reasoning over heterogeneous business systems, reaching 94.7% accuracy on supply-chain root-cause tasks with open-weight models.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle cs.SE · 2026-05-13 · unverdicted · none · ref 18 · internal anchor
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs cs.CV · 2026-05-12 · accept · none · ref 4 · internal anchor
A 30-token prompt requesting a neutral comparison table cuts sponsored recommendations in LLMs from roughly 50% to near zero.
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement cs.AI · 2026-05-11 · unverdicted · none · ref 6 · internal anchor
PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability cs.LG · 2026-05-09 · unverdicted · none · ref 73 · internal anchor
LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel cs.SE · 2026-05-08 · conditional · none · ref 21 · internal anchor
False-positive bug reports in the Linux kernel consume effort comparable to real bugs and can be filtered by LLMs using retrieval-augmented generation at 88% F1.
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models cs.CV · 2026-05-07 · unverdicted · none · ref 8 · internal anchor
DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
A Survey on LLM-based Conversational User Simulation cs.CL · 2026-04-27 · unverdicted · none · ref 3 · internal anchor
A survey that introduces a taxonomy for LLM-based conversational user simulation, analyzes core techniques and evaluation methods, and identifies open challenges in the field.
Exploring Audio Hallucination in Egocentric Video Understanding cs.CV · 2026-04-26 · unverdicted · none · ref 21 · internal anchor
AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models cs.MM · 2026-04-25 · unverdicted · none · ref 48 · internal anchor
OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.
Evian: Towards Explainable Visual Instruction-tuning Data Auditing cs.CV · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.
Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models cs.AI · 2026-04-16 · unverdicted · none · ref 2 · internal anchor
A generative reward model supplies separate semantic and turn-taking scores for spoken dialogues to enable more reliable reinforcement learning.

A Survey on LLM-as-a-Judge

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer