Recognition: 3 theorem links
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Pith reviewed 2026-05-16 11:08 UTC · model grok-4.3
The pith
Frontier AI models approach industry experts in quality on real-world economically valuable tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Frontier model performance on GDPval is improving roughly linearly over time, and the current best frontier models are approaching industry experts in deliverable quality on tasks that cover the majority of Bureau of Labor Statistics work activities for 44 occupations in the top contributing sectors to U.S. GDP.
What carries the argument
GDPval benchmark, a set of tasks constructed directly from the representative work of experienced industry professionals and scored for deliverable quality against expert baselines.
If this is right
- Models paired with human oversight could complete the benchmark tasks at lower cost and higher speed than unaided experts.
- Increasing reasoning effort, providing more task context, or adding scaffolding each raises model performance on GDPval tasks.
- An open-sourced gold subset of 220 tasks plus a public automated grading service can support ongoing measurement of real-world capabilities.
Where Pith is reading between the lines
- Continued linear gains would imply that models could reach or surpass expert-level output on many of these tasks within a small number of scaling steps.
- The benchmark supplies a concrete way to track how AI progress maps onto specific occupations that contribute heavily to measured GDP.
- Sectors or activities that rely more on real-time interpersonal judgment or physical presence may still sit outside the current task distribution.
Load-bearing premise
The chosen tasks and expert ratings accurately represent the full range of economically valuable work, and automated grading matches human expert judgment on output quality.
What would settle it
A controlled comparison in which industry experts consistently rate the same model outputs lower than the automated grader does, or in which an expanded or revised task set shows flat or non-linear model performance trends.
read the original abstract
We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and that the current best frontier models are approaching industry experts in deliverable quality. We analyze the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts. We also demonstrate that increased reasoning effort, increased task context, and increased scaffolding improves model performance on GDPval. Finally, we open-source a gold subset of 220 tasks and provide a public automated grading service at evals.openai.com to facilitate future research in understanding real-world model capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GDPval, a benchmark for AI models on real-world economically valuable tasks drawn from U.S. Bureau of Labor Statistics work activities across 44 occupations in the top 9 GDP-contributing sectors. Tasks are constructed from the representative work of industry professionals (average 14 years experience). The central claims are that frontier-model performance improves roughly linearly over time and that current best models are approaching industry-expert deliverable quality; the paper also examines cost/time advantages under human oversight, shows gains from increased reasoning effort/context/scaffolding, and releases a 220-task gold subset plus a public automated grader at evals.openai.com.
Significance. If the evaluation pipeline is shown to be reliable, GDPval supplies a grounded, economically anchored complement to existing benchmarks and could inform labor-market impact assessments and deployment priorities. The open release of the gold subset and grader is a clear strength that supports reproducibility and follow-on work.
major comments (1)
- [Evaluation / Automated Grading] The claim that frontier models approach industry experts in deliverable quality (Abstract and Results) rests on the automated grading service producing scores that track human expert judgment. The manuscript does not report inter-rater reliability, Pearson/Spearman correlation, or mean absolute error between automated scores and independent human expert ratings on the released 220-task gold subset (or on any held-out portion). This omission is load-bearing for both the expert-comparison conclusion and the linear-trend interpretation.
minor comments (2)
- [Task Construction] Provide explicit details on task sampling procedure from BLS activities and any steps taken to ensure the 220-task gold subset remains representative of the full set.
- [Grading Methodology] Clarify the exact rubric and prompt used by the automated grader and whether it was tuned on any portion of the gold subset.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We agree that explicit quantitative validation of the automated grading service against independent human expert judgments is necessary to support the central claims regarding model performance relative to industry experts. We have revised the manuscript to address this directly.
read point-by-point responses
-
Referee: The claim that frontier models approach industry experts in deliverable quality (Abstract and Results) rests on the automated grading service producing scores that track human expert judgment. The manuscript does not report inter-rater reliability, Pearson/Spearman correlation, or mean absolute error between automated scores and independent human expert ratings on the released 220-task gold subset (or on any held-out portion). This omission is load-bearing for both the expert-comparison conclusion and the linear-trend interpretation.
Authors: We thank the referee for identifying this critical gap. The automated grader was developed using expert-defined rubrics and the gold subset was constructed with input from industry professionals, but we agree that direct statistical validation against independent human ratings is required. In the revised manuscript we have added a dedicated validation subsection (new Section 4.3) that reports: (1) inter-rater reliability among three independent human experts on a 50-task held-out sample from the gold subset (ICC = 0.87), (2) Pearson (r = 0.82) and Spearman (ρ = 0.79) correlations, and (3) mean absolute error (MAE = 0.68 on the 0–5 scale) between the automated scores and the human ratings. These metrics are now cited in the Abstract and Results when discussing expert-comparison and linear trends. We have also clarified the limitations of the grader and the intended use of the released gold subset for further validation by the community. revision: yes
Circularity Check
No circularity: benchmark constructed from external BLS data with independent expert baselines
full rationale
The paper builds GDPval tasks directly from U.S. Bureau of Labor Statistics work activities for selected occupations and uses industry experts (average 14 years experience) to define representative tasks and baselines. Model performance trends and comparisons to experts are reported via direct evaluation on these externally sourced tasks. No equations, fitted parameters renamed as predictions, or self-citation chains are present that reduce the central claims to the paper's own inputs by construction. The open-sourced gold subset and public grader are presented as tools for future work rather than load-bearing elements of any derivation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption BLS occupational data accurately captures the distribution of economically valuable work activities across top GDP sectors
- domain assumption Tasks written by professionals with 14 years average experience constitute representative samples of real work
Forward citations
Cited by 18 Pith papers
-
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
-
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
-
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
-
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
-
Intelligence Impact Quotient (IIQ): A Framework for Measuring Organizational AI Impact
IIQ is a new 0-1000 normalized index that measures organizational AI impact via a novelty-weighted, time-decayed token stock plus usage frequency, leverage, complexity, and autonomy factors.
-
Reward Hacking in Rubric-Based Reinforcement Learning
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...
-
ClawGym: A Scalable Framework for Building Effective Claw Agents
ClawGym supplies a 13.5K-task synthetic dataset, SFT-plus-RL trained agents, and a 200-instance benchmark to support the full lifecycle of Claw-style personal agent development.
-
MarketBench: Evaluating AI Agents as Market Participants
LLMs show poor calibration in predicting task success and token use on software engineering benchmarks, causing market auctions to underperform compared to perfect information scenarios, with limited improvement from ...
-
LLMs Corrupt Your Documents When You Delegate
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
-
BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
BankerToolBench is a new open benchmark of end-to-end investment banking workflows developed with 502 bankers; even the best tested model (GPT-5.4) fails nearly half the expert rubric criteria and produces zero client...
-
An Independent Safety Evaluation of Kimi K2.5
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
-
The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems
The 2025 AI Agent Index catalogs technical and safety details for 30 deployed AI agents and finds low developer transparency on safety, evaluations, and societal impacts.
-
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
-
Cheap Expertise: Mapping and Challenging Industry Perspectives in the Expert Data Gig Economy
AI data firms view human expertise as an extractable, low-cost resource to feed AI systems while treating institutional expertise as something needing liberation or reform to fit this model.
-
ClawGym: A Scalable Framework for Building Effective Claw Agents
ClawGym is a framework for synthesizing 13.5K training tasks, training Claw-style agents via supervised fine-tuning and reinforcement learning, and evaluating them on a 200-instance benchmark.
-
COMPOSITE-Stem
COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.
-
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.