ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Ellie Evans; Jaehun Jung; Jan Kautz; Jiaqi Zeng; Pavlo Molchanov; Shizhe Diao; Ximing Lu; Yejin Choi; Yi Dong; Zhilin Wang

arxiv: 2510.18941 · v2 · pith:S6QZIOVAnew · submitted 2025-10-21 · 💻 cs.CL · cs.AI· cs.LG

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Zhilin Wang , Jaehun Jung , Ximing Lu , Shizhe Diao , Ellie Evans , Jiaqi Zeng , Pavlo Molchanov , Yejin Choi

show 2 more authors

Jan Kautz Yi Dong

This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords profbenchhttpsllmsmodelsprofessionalevaluatinghuggingfaceknowledge

0 comments

read the original abstract

Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench and Leaderboard: https://huggingface.co/spaces/nvidia/ProfBench

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
cs.AI 2026-05 unverdicted novelty 7.0

SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechan...
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
cs.AI 2026-05 unverdicted novelty 7.0

New benchmark evaluates three frontier deep research agents on 42 SME prompts with verifiers and rubrics, reporting low acceptance rates of 9.5-21.4% and agent-specific failure modes.
Visual Preference Optimization with Rubric Rewards
cs.CV 2026-04 unverdicted novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
Reward Hacking in Rubric-Based Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...
BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
cs.AI 2026-04 unverdicted novelty 6.0

BankerToolBench is a new open benchmark of end-to-end investment banking workflows developed with 502 bankers; even the best tested model (GPT-5.4) fails nearly half the expert rubric criteria and produces zero client...