arxiv: 2406.11939 · v2 · pith:KWEAOFPEnew · submitted 2024-06-17 · 💻 cs.LG · cs.AI· cs.CL

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Tianle Li , Wei-Lin Chiang , Evan Frick , Lisa Dunlap , Tianhao Wu , Banghua Zhu , Joseph E. Gonzalez , Ion Stoica This is my paper

classification 💻 cs.LG cs.AIcs.CL

keywords benchbuilderbenchmarkbenchmarkschallengingcurationhigh-qualityhumanmodel

0 comments

read the original abstract

The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply BenchBuilder to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark's alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts curated by BenchBuilder. Arena-Hard-Auto provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings, all at a cost of $20. Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating Large Language Models in Scientific Discovery
cs.AI 2025-12 unverdicted novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Green Shielding: A User-Centric Approach Towards Trustworthy AI
cs.CL 2026-04 unverdicted novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
cs.CL 2026-04 conditional novelty 7.0

SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs
cs.CL 2026-05 unverdicted novelty 6.0

LogiHard hardens reasoning benchmarks by transforming 0-order selection into 2-order judgment, causing 31-56% accuracy drops in 12 frontier LLMs and a 47% drop on zero-shot MMLU, revealing a combinatorial reasoning ga...
Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone
cs.AI 2026-05 unverdicted novelty 6.0

Deployment-relevant AI alignment cannot be inferred from model-level evaluations alone, as benchmark audits show missing interaction support and cross-model tests reveal model-dependent scaffold effects.
Hybrid Policy Distillation for LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve st...
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
Qwen2.5-1M Technical Report
cs.CL 2025-01 accept novelty 6.0

Qwen2.5-1M models reach 1M token context with improved long-context performance, no short-context loss, and 3-7x prefill speedup via open inference optimizations.
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
cs.AI 2026-04 unverdicted novelty 5.0

An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
cs.SE 2026-04 unverdicted novelty 5.0

LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Qwen3 Technical Report
cs.CL 2025-05 unverdicted novelty 5.0

Pith review generated a malformed one-line summary.
Submodular Benchmark Selection
cs.AI 2026-05 unverdicted novelty 4.0

Submodular maximization under a Gaussian model selects small benchmark subsets that outperform random selection for imputing leaderboard scores, with mutual information better than entropy at small sizes.
Ministral 3
cs.CL 2026-01 unverdicted novelty 4.0

Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.
Phi-4-reasoning Technical Report
cs.AI 2025-04 unverdicted novelty 4.0

A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related...
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
cs.LG 2026-05 unverdicted novelty 3.0

Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
cs.CL 2026-05 unverdicted novelty 3.0

EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.