pith. sign in

hub

Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024 a

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

hub tools

citation-role summary

dataset 2 method 1

citation-polarity summary

representative citing papers

ProactBench: Beyond What The User Asked For

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

cs.LG · 2026-03-06 · conditional · novelty 7.0

EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

SLR: Automated Synthesis for Scalable Logical Reasoning

cs.AI · 2025-06-18 · unverdicted · novelty 7.0

SLR automates creation of inductive logic tasks and a 20-level curriculum benchmark to train and evaluate LLM reasoning, yielding doubled accuracy for Llama-3-8B and generalization to other benchmarks.

Open-World Evaluations for Measuring Frontier AI Capabilities

cs.AI · 2026-05-19 · conditional · novelty 6.0

Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.

General Preference Reinforcement Learning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0 · 3 refs

GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.

Submodular Benchmark Selection

cs.AI · 2026-05-04 · unverdicted · novelty 4.0

Submodular maximization under a Gaussian model selects small benchmark subsets that outperform random selection for imputing leaderboard scores, with mutual information better than entropy at small sizes.

Ministral 3

cs.CL · 2026-01-13 · unverdicted · novelty 4.0

Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.

citing papers explorer

Showing 17 of 17 citing papers.