LLMBar: A benchmark for evaluating LLM evaluators on pairwise comparisons.arXiv preprint arXiv:2310.07641

Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, Danqi Chen · 2023 · arXiv 2310.07641

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

cs.LG · 2024-10-05 · unverdicted · novelty 7.0

TS-Reasoner is a domain-oriented agent using LLMs, computational tools, and error feedback for multi-step time series inference, showing better performance than general LLMs on understanding and reasoning benchmarks.

Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

cs.CL · 2025-07-08 · unverdicted · novelty 6.0

REFORM uses reward-guided controlled decoding to generate adversarial failures and augments training data to improve reward model robustness on preference datasets.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

LLM Evaluators Recognize and Favor Their Own Generations

cs.CL · 2024-04-15 · unverdicted · novelty 6.0

LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

cs.AI · 2026-04-25 · unverdicted · novelty 5.0

Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.

A Survey on LLM-as-a-Judge

cs.CL · 2024-11-23 · unverdicted · novelty 4.0

A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

cs.AI · 2024-10-24 · unverdicted · novelty 4.0

Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.

citing papers explorer

Showing 8 of 8 citing papers.

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents? cs.CL · 2026-05-18 · unverdicted · none · ref 60
REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.
TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis cs.LG · 2024-10-05 · unverdicted · none · ref 15
TS-Reasoner is a domain-oriented agent using LLMs, computational tools, and error feedback for multi-step time series inference, showing better performance than general LLMs on understanding and reasoning benchmarks.
Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling cs.CL · 2025-07-08 · unverdicted · none · ref 36
REFORM uses reward-guided controlled decoding to generate adversarial failures and augments training data to improve reward model robustness on preference datasets.
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 172
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
LLM Evaluators Recognize and Favor Their Own Generations cs.CL · 2024-04-15 · unverdicted · none · ref 34
LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines cs.AI · 2026-04-25 · unverdicted · none · ref 26
Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 202
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs cs.AI · 2024-10-24 · unverdicted · none · ref 26
Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.

LLMBar: A benchmark for evaluating LLM evaluators on pairwise comparisons.arXiv preprint arXiv:2310.07641

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer