hub

Judgelm: Fine-tuned large language models are scalable judges

Lianghui Zhu, Xinggang Wang, Xinlong Wang · 2023 · arXiv 2310.17631

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval

cs.IR · 2026-04-26 · accept · novelty 7.0

Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.

Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

cs.CV · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.

Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

cs.CL · 2024-04-29 · conditional · novelty 7.0

A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Tree-of-Writing achieves 0.93 Pearson correlation with human judgments by using a tree-structured workflow to aggregate sub-feature scores, outperforming standard LLM-as-a-judge and overlap metrics on the new HowToBench.

AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

cs.AI · 2026-03-22 · conditional · novelty 6.0

AdaRubric adaptively generates task-specific rubrics via LLM, scores agent trajectories with per-dimension confidence weighting, and produces filtered DPO pairs that raise human correlation to Pearson r=0.79 and downstream task success by 6.8-8.5%.

ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness

cs.DC · 2026-02-14 · unverdicted · novelty 6.0

ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.

Supporting System Testing with a Multi-Agent LLM-based Framework for Knowledge Graph Extraction: A Case Study with Ethernet Switch Systems

cs.SE · 2026-05-18 · conditional · novelty 5.0

A multi-agent LLM-based framework extracts knowledge graphs from 50 real Ethernet switch manuals with 0.97-0.99 correctness to enable downstream test case specification generation.

LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

cs.SE · 2026-04-30 · unverdicted · novelty 5.0

LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.

How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality

cs.CL · 2026-04-08 · unverdicted · novelty 5.0

Weak LLM judges accept wrong answers more often when shown fluent reasoning chains, while strong judges use them partially but still get misled by high-quality-looking but flawed reasoning.

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

cs.CL · 2026-04-07 · unverdicted · novelty 5.0

MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choice affects scores.

Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts

cs.CL · 2026-04-03 · unverdicted · novelty 5.0

Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.

Refining and Reusing Annotation Guidelines for LLM Annotation

cs.CL · 2026-05-20 · conditional · novelty 4.0

An iterative moderation framework refines and reuses annotation guidelines to improve LLM annotation accuracy on biomedical NER tasks across GPT, Gemini, and DeepSeek models.

FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking

cs.AI · 2026-05-06 · unverdicted · novelty 4.0

FinRAG-12B is a production-deployed 12B model for banking that grounds answers with citations, refuses unanswerable queries at a calibrated 12% rate, outperforms GPT-4.1 on grounding, and improves query resolution by 7.1 points across 40+ institutions at 20-50x lower cost.

LLM-as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection

cs.AI · 2026-04-07 · unverdicted · novelty 4.0

An LLM produces consistent categorical judgments and appropriate confidence declines when evaluating powerline segmentation quality under controlled visual degradations, suggesting it can serve as a reliable watchdog.

Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

cs.CL · 2025-11-03 · unverdicted · novelty 4.0

Fine-tuning LLMs on multi-source synthetic data mitigates distribution collapse and self-preference bias while increasing output quality relative to single-source or human-only fine-tuning.

A Survey on LLM-as-a-Judge

cs.CL · 2024-11-23 · unverdicted · novelty 4.0

A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

citing papers explorer

Showing 17 of 17 citing papers.

Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval cs.IR · 2026-04-26 · accept · none · ref 44
Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks cs.CV · 2026-04-21 · unverdicted · none · ref 6 · 2 links
StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts cs.CL · 2026-04-20 · unverdicted · none · ref 50
Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models cs.CL · 2024-04-29 · conditional · none · ref 54
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing cs.CL · 2026-04-21 · unverdicted · none · ref 3
Tree-of-Writing achieves 0.93 Pearson correlation with human judgments by using a tree-structured workflow to aggregate sub-feature scores, outperforming standard LLM-as-a-judge and overlap metrics on the new HowToBench.
AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning cs.AI · 2026-03-22 · conditional · none · ref 13
AdaRubric adaptively generates task-specific rubrics via LLM, scores agent trajectories with per-dimension confidence weighting, and produces filtered DPO pairs that raise human correlation to Pearson r=0.79 and downstream task success by 6.8-8.5%.
ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness cs.DC · 2026-02-14 · unverdicted · none · ref 23
ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.
Supporting System Testing with a Multi-Agent LLM-based Framework for Knowledge Graph Extraction: A Case Study with Ethernet Switch Systems cs.SE · 2026-05-18 · conditional · none · ref 40
A multi-agent LLM-based framework extracts knowledge graphs from 50 real Ethernet switch manuals with 0.97-0.99 correctness to enable downstream test case specification generation.
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding cs.SE · 2026-04-30 · unverdicted · none · ref 25
LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality cs.CL · 2026-04-08 · unverdicted · none · ref 5
Weak LLM judges accept wrong answers more often when shown fluent reasoning chains, while strong judges use them partially but still get misled by high-quality-looking but flawed reasoning.
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts cs.CL · 2026-04-07 · unverdicted · none · ref 32
MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choice affects scores.
Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts cs.CL · 2026-04-03 · unverdicted · none · ref 50
Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.
Refining and Reusing Annotation Guidelines for LLM Annotation cs.CL · 2026-05-20 · conditional · none · ref 16
An iterative moderation framework refines and reuses annotation guidelines to improve LLM annotation accuracy on biomedical NER tasks across GPT, Gemini, and DeepSeek models.
FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking cs.AI · 2026-05-06 · unverdicted · none · ref 3
FinRAG-12B is a production-deployed 12B model for banking that grounds answers with citations, refuses unanswerable queries at a calibrated 12% rate, outperforms GPT-4.1 on grounding, and improves query resolution by 7.1 points across 40+ institutions at 20-50x lower cost.
LLM-as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection cs.AI · 2026-04-07 · unverdicted · none · ref 4
An LLM produces consistent categorical judgments and appropriate confidence declines when evaluating powerline segmentation quality under controlled visual degradations, suggesting it can serve as a reliable watchdog.
Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning cs.CL · 2025-11-03 · unverdicted · none · ref 8
Fine-tuning LLMs on multi-source synthetic data mitigates distribution collapse and self-preference bias while increasing output quality relative to single-source or human-only fine-tuning.
A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 229
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

Judgelm: Fine-tuned large language models are scalable judges

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer