ProfBench is a new multi-domain benchmark with human-expert rubrics for judging LLM responses on professional tasks, showing top models reach only 65.9% performance while providing cheap LLM judges that reduce evaluation cost by orders of magnitude.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
ProfBench is a new multi-domain benchmark with human-expert rubrics for judging LLM responses on professional tasks, showing top models reach only 65.9% performance while providing cheap LLM judges that reduce evaluation cost by orders of magnitude.