CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.
Autobencher: Towards declarative benchmark construction
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
Generalizable agents require environment scaling via diverse executable rule-sets, distinguished from trajectory and task scaling in a new taxonomy.
citing papers explorer
-
CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks
CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.
-
Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
-
Scalable Environments Drive Generalizable Agents
Generalizable agents require environment scaling via diverse executable rule-sets, distinguished from trajectory and task scaling in a new taxonomy.