Yourbench: Easy custom evaluation sets for everyone

Sumuk Shashidhar, Cl´ ementine Fourrier, Alina Lozovskia, Thomas Wolf, Gokhan Tur, Dilek Hakkani-T¨ ur · 2025 · arXiv 2504.01833

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

cs.AI · 2026-04-27 · unverdicted · novelty 5.0

STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.

Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

cs.AI · 2026-04-06 · unverdicted · novelty 5.0

A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.

citing papers explorer

Showing 2 of 2 citing papers.

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator cs.AI · 2026-04-27 · unverdicted · none · ref 27
STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis cs.AI · 2026-04-06 · unverdicted · none · ref 11
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.

Yourbench: Easy custom evaluation sets for everyone

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer