Arena-Hard Auto: Evaluating LLMs with Human-in-the-loop Standards.https://lmsys.org/blog/2024-04-19-arena-hard/

Tianle Li, Wei-Lin Chiang, Evan Frick, et al · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

cs.AI · 2026-03-27 · unverdicted · novelty 6.0

XpertBench provides 1,346 rubric-scored expert tasks showing leading LLMs achieve a maximum ~66% success rate and ~55% mean score across domains.

Showing 1 of 1 citing paper.

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation cs.AI · 2026-03-27 · unverdicted · none · ref 17
XpertBench provides 1,346 rubric-scored expert tasks showing leading LLMs achieve a maximum ~66% success rate and ~55% mean score across domains.