CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.
Yourbench: Easy custom evaluation sets for everyone
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 3roles
background 1polarities
background 1representative citing papers
STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.
citing papers explorer
No citing papers match the current filters.