pith. sign in

LLMBar: A benchmark for evaluating LLM evaluators on pairwise comparisons.arXiv preprint arXiv:2310.07641

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

citation-role summary

background 1

citation-polarity summary

verdicts

UNVERDICTED 8

roles

background 1

polarities

background 1

representative citing papers

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

LLM Evaluators Recognize and Favor Their Own Generations

cs.CL · 2024-04-15 · unverdicted · novelty 6.0

LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.

A Survey on LLM-as-a-Judge

cs.CL · 2024-11-23 · unverdicted · novelty 4.0

A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

citing papers explorer

Showing 8 of 8 citing papers.