arxiv: 2510.04374 · v1 · submitted 2025-10-05 · 💻 cs.LG · cs.AI· cs.CY

Recognition: 3 theorem links

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan , Rachel Dias , Elizabeth Proehl , Grace Kim , Michele Wang , Olivia Watkins , Sim\'on Posada Fishman , Marwan Aljubeh

show 11 more authors

Phoebe Thacker Laurance Fauconnet Natalie S. Kim Patrick Chao Samuel Miserendino Gildas Chabot David Li Michael Sharman Alexandra Barr Amelia Glaese Jerry Tworek

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CY

keywords AI benchmarkingeconomic tasksfrontier modelsperformance evaluationlabor activitiesGDP sectorsmodel capabilities

0 comments

The pith

Frontier AI models approach industry experts in quality on real-world economically valuable tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GDPval, a benchmark built from representative work activities across 44 occupations in the top nine U.S. GDP sectors, using tasks drawn from professionals with an average of 14 years experience. It reports that frontier model performance on these tasks has improved roughly linearly over time and that leading models now produce deliverables close in quality to those of the human experts. The work also examines how pairing models with human oversight could reduce cost and time for the tasks, and shows that extra reasoning effort, task context, and scaffolding each raise model scores.

Core claim

Frontier model performance on GDPval is improving roughly linearly over time, and the current best frontier models are approaching industry experts in deliverable quality on tasks that cover the majority of Bureau of Labor Statistics work activities for 44 occupations in the top contributing sectors to U.S. GDP.

What carries the argument

GDPval benchmark, a set of tasks constructed directly from the representative work of experienced industry professionals and scored for deliverable quality against expert baselines.

If this is right

Models paired with human oversight could complete the benchmark tasks at lower cost and higher speed than unaided experts.
Increasing reasoning effort, providing more task context, or adding scaffolding each raises model performance on GDPval tasks.
An open-sourced gold subset of 220 tasks plus a public automated grading service can support ongoing measurement of real-world capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Continued linear gains would imply that models could reach or surpass expert-level output on many of these tasks within a small number of scaling steps.
The benchmark supplies a concrete way to track how AI progress maps onto specific occupations that contribute heavily to measured GDP.
Sectors or activities that rely more on real-time interpersonal judgment or physical presence may still sit outside the current task distribution.

Load-bearing premise

The chosen tasks and expert ratings accurately represent the full range of economically valuable work, and automated grading matches human expert judgment on output quality.

What would settle it

A controlled comparison in which industry experts consistently rate the same model outputs lower than the automated grader does, or in which an expanded or revised task set shows flat or non-linear model performance trends.

read the original abstract

We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and that the current best frontier models are approaching industry experts in deliverable quality. We analyze the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts. We also demonstrate that increased reasoning effort, increased task context, and increased scaffolding improves model performance on GDPval. Finally, we open-source a gold subset of 220 tasks and provide a public automated grading service at evals.openai.com to facilitate future research in understanding real-world model capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GDPval ties models to BLS-derived GDP tasks with linear gains and open data, but the expert-approaching claim needs grader-human agreement numbers.

read the letter

The core point is that this paper builds GDPval from BLS work activities across the top GDP sectors, using tasks written by professionals averaging 14 years experience. It reports roughly linear improvement in frontier models over time and says the best ones are nearing expert deliverable quality. They also test cost/speed gains with human oversight and show that more reasoning, context, and scaffolding lift scores. They release a 220-task gold subset plus a public grader at evals.openai.com.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces GDPval, a benchmark for AI models on real-world economically valuable tasks drawn from U.S. Bureau of Labor Statistics work activities across 44 occupations in the top 9 GDP-contributing sectors. Tasks are constructed from the representative work of industry professionals (average 14 years experience). The central claims are that frontier-model performance improves roughly linearly over time and that current best models are approaching industry-expert deliverable quality; the paper also examines cost/time advantages under human oversight, shows gains from increased reasoning effort/context/scaffolding, and releases a 220-task gold subset plus a public automated grader at evals.openai.com.

Significance. If the evaluation pipeline is shown to be reliable, GDPval supplies a grounded, economically anchored complement to existing benchmarks and could inform labor-market impact assessments and deployment priorities. The open release of the gold subset and grader is a clear strength that supports reproducibility and follow-on work.

major comments (1)

[Evaluation / Automated Grading] The claim that frontier models approach industry experts in deliverable quality (Abstract and Results) rests on the automated grading service producing scores that track human expert judgment. The manuscript does not report inter-rater reliability, Pearson/Spearman correlation, or mean absolute error between automated scores and independent human expert ratings on the released 220-task gold subset (or on any held-out portion). This omission is load-bearing for both the expert-comparison conclusion and the linear-trend interpretation.

minor comments (2)

[Task Construction] Provide explicit details on task sampling procedure from BLS activities and any steps taken to ensure the 220-task gold subset remains representative of the full set.
[Grading Methodology] Clarify the exact rubric and prompt used by the automated grader and whether it was tuned on any portion of the gold subset.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We agree that explicit quantitative validation of the automated grading service against independent human expert judgments is necessary to support the central claims regarding model performance relative to industry experts. We have revised the manuscript to address this directly.

read point-by-point responses

Referee: The claim that frontier models approach industry experts in deliverable quality (Abstract and Results) rests on the automated grading service producing scores that track human expert judgment. The manuscript does not report inter-rater reliability, Pearson/Spearman correlation, or mean absolute error between automated scores and independent human expert ratings on the released 220-task gold subset (or on any held-out portion). This omission is load-bearing for both the expert-comparison conclusion and the linear-trend interpretation.

Authors: We thank the referee for identifying this critical gap. The automated grader was developed using expert-defined rubrics and the gold subset was constructed with input from industry professionals, but we agree that direct statistical validation against independent human ratings is required. In the revised manuscript we have added a dedicated validation subsection (new Section 4.3) that reports: (1) inter-rater reliability among three independent human experts on a 50-task held-out sample from the gold subset (ICC = 0.87), (2) Pearson (r = 0.82) and Spearman (ρ = 0.79) correlations, and (3) mean absolute error (MAE = 0.68 on the 0–5 scale) between the automated scores and the human ratings. These metrics are now cited in the Abstract and Results when discussing expert-comparison and linear trends. We have also clarified the limitations of the grader and the intended use of the released gold subset for further validation by the community. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark constructed from external BLS data with independent expert baselines

full rationale

The paper builds GDPval tasks directly from U.S. Bureau of Labor Statistics work activities for selected occupations and uses industry experts (average 14 years experience) to define representative tasks and baselines. Model performance trends and comparisons to experts are reported via direct evaluation on these externally sourced tasks. No equations, fitted parameters renamed as predictions, or self-citation chains are present that reduce the central claims to the paper's own inputs by construction. The open-sourced gold subset and public grader are presented as tools for future work rather than load-bearing elements of any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that BLS work activities for the chosen occupations are representative of economically valuable labor and that expert ratings provide a stable quality reference. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption BLS occupational data accurately captures the distribution of economically valuable work activities across top GDP sectors
Used to select the 44 occupations and 9 sectors
domain assumption Tasks written by professionals with 14 years average experience constitute representative samples of real work
Basis for constructing the evaluation items

pith-pipeline@v0.9.0 · 5541 in / 1204 out tokens · 24152 ms · 2026-05-16T11:08:12.153746+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing
cs.CV 2026-04 unverdicted novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
cs.CL 2026-04 unverdicted novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
cs.CL 2026-04 unverdicted novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
Intelligence Impact Quotient (IIQ): A Framework for Measuring Organizational AI Impact
cs.AI 2026-05 unverdicted novelty 6.0

IIQ is a new 0-1000 normalized index that measures organizational AI impact via a novelty-weighted, time-decayed token stock plus usage frequency, leverage, complexity, and autonomy factors.
Reward Hacking in Rubric-Based Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...
ClawGym: A Scalable Framework for Building Effective Claw Agents
cs.CL 2026-04 unverdicted novelty 6.0

ClawGym supplies a 13.5K-task synthetic dataset, SFT-plus-RL trained agents, and a 200-instance benchmark to support the full lifecycle of Claw-style personal agent development.
MarketBench: Evaluating AI Agents as Market Participants
cs.AI 2026-04 unverdicted novelty 6.0

LLMs show poor calibration in predicting task success and token use on software engineering benchmarks, causing market auctions to underperform compared to perfect information scenarios, with limited improvement from ...
LLMs Corrupt Your Documents When You Delegate
cs.CL 2026-04 unverdicted novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
cs.AI 2026-04 unverdicted novelty 6.0

BankerToolBench is a new open benchmark of end-to-end investment banking workflows developed with 502 bankers; even the best tested model (GPT-5.4) fails nearly half the expert rubric criteria and produces zero client...
An Independent Safety Evaluation of Kimi K2.5
cs.CR 2026-04 conditional novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems
cs.CY 2026-02 accept novelty 6.0

The 2025 AI Agent Index catalogs technical and safety details for 30 deployed AI agents and finds low developer transparency on safety, evaluations, and societal impacts.
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
cs.CL 2026-02 conditional novelty 6.0

EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
Cheap Expertise: Mapping and Challenging Industry Perspectives in the Expert Data Gig Economy
cs.CY 2026-05 unverdicted novelty 5.0

AI data firms view human expertise as an extractable, low-cost resource to feed AI systems while treating institutional expertise as something needing liberation or reform to fit this model.
ClawGym: A Scalable Framework for Building Effective Claw Agents
cs.CL 2026-04 unverdicted novelty 5.0

ClawGym is a framework for synthesizing 13.5K training tasks, training Claw-style agents via supervised fine-tuning and reinforcement learning, and evaluating them on a 200-instance benchmark.
COMPOSITE-Stem
cs.AI 2026-04 conditional novelty 5.0

COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.
GLM-5: from Vibe Coding to Agentic Engineering
cs.LG 2026-02 unverdicted novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.