pith. sign in

arxiv: 2510.18941 · v2 · pith:S6QZIOVAnew · submitted 2025-10-21 · 💻 cs.CL · cs.AI· cs.LG

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

classification 💻 cs.CL cs.AIcs.LG
keywords profbenchhttpsllmsmodelsprofessionalevaluatinghuggingfaceknowledge
0
0 comments X
read the original abstract

Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench and Leaderboard: https://huggingface.co/spaces/nvidia/ProfBench

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

    cs.AI 2026-05 unverdicted novelty 7.0

    SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechan...

  2. Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

    cs.AI 2026-05 unverdicted novelty 7.0

    New benchmark evaluates three frontier deep research agents on 42 SME prompts with verifiers and rubrics, reporting low acceptance rates of 9.5-21.4% and agent-specific failure modes.

  3. Visual Preference Optimization with Rubric Rewards

    cs.CV 2026-04 unverdicted novelty 7.0

    rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

  4. Reward Hacking in Rubric-Based Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...

  5. BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

    cs.AI 2026-04 unverdicted novelty 6.0

    BankerToolBench is a new open benchmark of end-to-end investment banking workflows developed with 502 bankers; even the best tested model (GPT-5.4) fails nearly half the expert rubric criteria and produces zero client...