BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

Alex Wang; Chase Blagden; Eric Xu; Georg Meinhardt; Jacob Katz; Joseph H. Kim; Pratyush K. Chaudhary

arxiv: 2606.03829 · v1 · pith:FARGNDAPnew · submitted 2026-06-02 · 💻 cs.AI

BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

Alex Wang , Georg Meinhardt , Jacob Katz , Joseph H. Kim , Pratyush K. Chaudhary , Chase Blagden , Eric Xu This is my paper

Pith reviewed 2026-06-28 09:33 UTC · model grok-4.3

classification 💻 cs.AI

keywords financial research benchmarkAI agent evaluationderivation qualityrubric scoringworkflow evaluationauditable reasoningLLM benchmarkingpartial credit assessment

0 comments

The pith

A new benchmark measures AI financial agents on complete auditable derivations using expert rubrics rather than final answers alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Financial-research answers matter only when their full production process can be audited, including source selection, definitions, assumptions, and calculations. Existing benchmarks focus on isolated skills or end results and therefore miss this requirement. BigFinanceBench supplies 928 expert-authored tasks, each paired with a ground-truth answer and a point-weighted rubric that breaks the derivation into independently verifiable steps. When ten frontier and open-weight agents are tested, the strongest reaches 58.8 percent on the rubrics. Final-answer accuracy turns out to be a useful yet lossy signal, and performance gaps appear unevenly across different financial workflows.

Core claim

BigFinanceBench is a workflow-grounded benchmark of 928 open-ended financial-research tasks in which each item supplies both a reference answer and a point-weighted rubric that decomposes the required derivation into checkable steps. Across 36,241 rubric points the benchmark therefore supports partial-credit scoring and failure localization along the analyst workflow. Evaluation of ten current agents shows the best system attaining only 58.8 percent rubric score, demonstrates that final-answer accuracy is an incomplete proxy for derivation quality, and reveals non-uniform capability variation across financial workflows.

What carries the argument

Point-weighted rubrics that decompose each financial-research derivation into independently checkable steps and thereby enable partial-credit evaluation of the full workflow.

If this is right

Final-answer accuracy alone is an incomplete measure of agent performance on financial tasks.
Agent capability is not uniform; some workflows expose larger gaps than others.
Substantial headroom remains for agents that can produce auditable derivations.
The benchmark permits localization of specific failure points within the research workflow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Rubric-based workflow evaluation could be adapted to other domains that require traceable reasoning, such as legal analysis or scientific literature review.
Training regimes focused on step-by-step derivation rather than end answers might close the observed gaps more effectively than scaling alone.
Extending the benchmark to include live data feeds or multi-document synthesis would test whether current headroom persists under more realistic conditions.

Load-bearing premise

The expert-authored rubrics supply a reliable, comprehensive, and unbiased breakdown of what constitutes high-quality financial-research derivation.

What would settle it

A study in which independent human experts score the same agent outputs once with the provided rubrics and once with holistic judgment, then find low agreement between the two methods, would show the rubrics do not capture derivation quality.

Figures

Figures reproduced from arXiv: 2606.03829 by Alex Wang, Chase Blagden, Eric Xu, Georg Meinhardt, Jacob Katz, Joseph H. Kim, Pratyush K. Chaudhary.

**Figure 1.** Figure 1: Model performance on BIGFINANCEBENCH: point-weighted rubric score (left) and finalanswer accuracy (right). Error bars show 95% bootstrap confidence intervals over questions; hatched bars are open-weight systems. *Equal contribution. †Work done while at Rogo. Preprint. arXiv:2606.03829v1 [cs.AI] 2 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Dataset construction workflow and resulting-corpus statistics for B [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Workflow composition of BIGFINANCEBENCH. (a) Joint distribution of analyst workflow type against the analytical skill each question requires. (b) Marginal distribution over workflow types. See Appendix G for additional sector coverage and question difficulty insights. 4 Benchmark Evaluation 4.1 Setup Agent harness. We evaluate current frontier and open-weight models in a common open-book ReAct-style harnes… view at source ↗

**Figure 4.** Figure 4: Performance versus cost frontier on BIGFINANCEBENCH. Each point plots a model’s score against average inference cost per question, computed from trace token counts and provider list prices. Lines mark the metric-specific Pareto frontiers [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Per-model rubric score against final-answer accuracy. Every system sits above the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The headline tie among the top-3 closed models is a partition: each leads on different [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Workflow-stage accuracy and clean-setup calculation. Panel (a) maps each rubric line to [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 7.** Figure 7: The graph over models and questions is connected, so the model effects are comparable [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: gives the corresponding appendix marginals: sector coverage is broad but concentrated in TMT and diversified items, while most questions specify precision and many remain single-period lookups despite having compact final answers [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Rubric-criterion stage mix by workflow classified as [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: ReAct rollout stop-reason distribution per model. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Per-model Cohen’s κ between the two judges (Gemini 3.1 Pro Preview and Claude Opus 4.7) on final-answer correctness, paired by (question, trial). All values exceed the 0.81 “almost perfect” threshold of Landis and Koch [1977]. Within-question trial reliability. Inter-judge κ controls for judge noise on a single trial; the complementary axis is rollout noise across the n = 3 trials per question for a fixed… view at source ↗

**Figure 12.** Figure 12: Per-question rubric-score standard deviation across the [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Per-model rubric score conditioned on the source-document type the question requires. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Per-(question, model) score, rubric (left) vs final-answer (right). Each row is one question; [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Routing ladder: achievable rubric (teal) and final-answer accuracy (orange) at increasing [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Per-workflow rubric (left) and final-answer accuracy (right) of the top-3 closed models in [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

read the original abstract

Financial-research answers are decision-relevant only when another analyst can audit how they were produced: which source was chosen, which period and accounting definition were used, which assumptions were made, and how the calculation was performed. Existing finance benchmarks largely evaluate isolated subskills or final answers, leaving the auditable derivation itself under-measured. We introduce BigFinanceBench, a 928-item expert-authored benchmark of open-ended financial-research tasks in which each item pairs a ground-truth reference answer with a point-weighted rubric that decomposes the derivation into independently checkable steps. BigFinanceBench is workflow-grounded in that it evaluates the full derivation rather than only the final output. Across 36,241 rubric points, the benchmark supports partial-credit evaluation and localization of failures across the analyst workflow. Evaluating ten current frontier and open-weight agents, we find substantial headroom: the best system reaches only 58.8% rubric score, final-answer accuracy is a useful but lossy proxy for derivation quality, and model capability varies non-uniformly across financial workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BigFinanceBench adds a derivation-scoring benchmark with real agent results, but the rubrics lack any reported reliability checks.

read the letter

The main point is that this paper introduces BigFinanceBench, a 928-task set with expert point-weighted rubrics that score full financial derivations rather than just final answers, and it reports that the best of ten tested agents reaches only 58.8% on the rubrics.

It does a solid job filling the gap the abstract describes. Existing finance benchmarks often stop at isolated skills or end results, while this one decomposes workflows into checkable steps across 36,241 rubric points. The findings that final-answer accuracy misses derivation quality and that performance varies across workflows are concrete and useful for anyone building agents for auditable tasks.

The soft spot is exactly the one the stress-test flags. The paper describes expert-authored rubrics but supplies no inter-rater agreement numbers, no correlation with independent overall quality judgments, and no checks for ambiguous or overlapping steps. Without that evidence the absolute scores and the agent comparisons rest on untested assumptions about how well the rubrics capture derivation quality.

This is for people who evaluate or improve AI on financial research workflows. A reader focused on benchmark design or regulated-domain agents would get value from the workflow grounding and the scale of the evaluations.

It deserves peer review because the core design is new and the agent results are substantive, though any review should press for rubric validation data.

Referee Report

1 major / 1 minor

Summary. The paper introduces BigFinanceBench, a 928-item benchmark of open-ended financial-research tasks. Each item includes a ground-truth reference answer paired with an expert-authored, point-weighted rubric that decomposes the derivation into independently checkable steps (totaling 36,241 rubric points). The benchmark is positioned as workflow-grounded, enabling partial-credit evaluation of full derivations rather than isolated subskills or final answers alone. Evaluation of ten frontier and open-weight agents shows the best system achieves only 58.8% rubric score, that final-answer accuracy is a lossy proxy for derivation quality, and that model capability varies non-uniformly across financial workflows.

Significance. If the rubric decompositions prove reliable and comprehensive, the benchmark would address a genuine gap in existing finance evaluations by measuring auditable derivation quality at scale. The reported headroom (58.8% ceiling) and the distinction between final-answer accuracy and rubric score would supply concrete, falsifiable targets for agent development in a high-stakes domain.

major comments (1)

[Abstract and methods description of rubric construction] The central claims (58.8% best score, final-answer accuracy as lossy proxy, non-uniform workflow variation) rest on the assumption that the 36,241 point-weighted rubric items constitute an objective, reliable decomposition of derivation quality. The manuscript describes expert authorship and point weighting but reports no inter-rater agreement statistics, no external correlation with independent expert quality judgments, and no sensitivity analysis to rubric authoring choices. Without such validation, both absolute scores and comparative agent rankings remain sensitive to the specific rubric construction rather than to agent behavior alone.

minor comments (1)

[Abstract] The abstract states benchmark size, rubric count, and agent scores but supplies no methods details on how the 58.8% figure or rubric reliability was established; this should be expanded in the main text even if full methods appear later.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for rubric validation. This is a substantive point that strengthens the benchmark's credibility. We address it directly below and commit to revisions that add empirical support for rubric reliability.

read point-by-point responses

Referee: [Abstract and methods description of rubric construction] The central claims (58.8% best score, final-answer accuracy as lossy proxy, non-uniform workflow variation) rest on the assumption that the 36,241 point-weighted rubric items constitute an objective, reliable decomposition of derivation quality. The manuscript describes expert authorship and point weighting but reports no inter-rater agreement statistics, no external correlation with independent expert quality judgments, and no sensitivity analysis to rubric authoring choices. Without such validation, both absolute scores and comparative agent rankings remain sensitive to the specific rubric construction rather than to agent behavior alone.

Authors: We agree that inter-rater agreement, external correlation, and sensitivity analysis are important for establishing that rubric scores reflect agent behavior rather than authoring artifacts. The current manuscript relies on expert authorship by domain specialists and workflow-grounded decomposition but does not quantify reliability. In the revised manuscript we will add: (1) inter-rater agreement on a stratified sample of 100 tasks scored independently by a second financial expert, reporting Cohen's kappa and percentage agreement; (2) correlation between rubric scores and an independent expert's holistic quality rating on the same sample; and (3) a brief sensitivity discussion noting that rubric points were derived from standard financial-analysis workflows (e.g., DCF, ratio analysis) with explicit point allocation rules. These additions will directly support the reported 58.8% ceiling and the claim that final-answer accuracy is lossy. We view this as a necessary major revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and evaluations are self-contained

full rationale

The paper introduces a new expert-authored benchmark (BigFinanceBench) consisting of 928 tasks with point-weighted rubrics and reports direct evaluations of ten agents on those rubrics. No equations, fitted parameters, or predictions are defined in terms of the target results; the central claims (58.8% best score, final-answer accuracy as lossy proxy, non-uniform variation) follow from straightforward scoring of agent outputs against the provided rubrics. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the benchmark construction or results. The work is externally falsifiable via the released benchmark items and does not reduce any reported quantity to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the central claim rests on the assumption that expert rubrics validly measure derivation quality, with no free parameters or invented entities described.

axioms (1)

domain assumption Expert-authored rubrics can decompose financial research derivations into independently checkable steps with reliable point weights.
This premise underpins the partial-credit evaluation and localization of failures across the 36,241 rubric points.

pith-pipeline@v0.9.1-grok · 5730 in / 1370 out tokens · 34760 ms · 2026-06-28T09:33:47.614057+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 20 canonical work pages · 9 internal anchors

[1]

FinTradeBench: A Financial Reasoning Benchmark for LLMs

Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, and Aritra Dutta. Fin- TradeBench: A financial reasoning benchmark for LLMs.arXiv preprint arXiv:2603.19225,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

PRBench: Large-scale expert rubrics for evaluating high-stakes professional reasoning.arXiv preprint arXiv:2511.11562,

Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, et al. PRBench: Large-scale expert rubrics for evaluating high-stakes professional reasoning.arXiv preprint arXiv:2511.11562,

work page arXiv
[3]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Health- Bench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Finance Agent Benchmark: Benchmarking LLMs on real-world financial research tasks.arXiv preprint arXiv:2508.00828,

Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance Agent Benchmark: Benchmarking LLMs on real-world financial research tasks.arXiv preprint arXiv:2508.00828,

work page arXiv
[5]

FinQA: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. FinQA: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711,

2021
[6]

FinDER: Financial dataset for question answering and evaluating retrieval-augmented generation

Chanyeol Choi, Jihoon Kwon, Jaeseon Ha, Hojun Choi, Chaewoon Kim, Yongjae Lee, Jy-yong Sohn, and Alejandro Lopez-Lira. FinDER: Financial dataset for question answering and evaluating retrieval-augmented generation. InProceedings of the 6th ACM International Conference on AI in Finance, pages 638–646, 2025a. Chanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira...

2025
[7]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[9]

Preprint, arXiv:2508.12790

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors.arXiv preprint arXiv:2508.12790,

work page arXiv
[10]

FinanceBench: A New Benchmark for Financial Question Answering

10 Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. Fi- nanceBench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

FinRetrieval: A benchmark for financial data retrieval by AI agents

Eric Y Kim and Jie Huang. FinRetrieval: A benchmark for financial data retrieval by AI agents. arXiv preprint arXiv:2603.04403,

work page arXiv
[12]

ToolSandbox: A stateful, conversational, interactive evalua- tion benchmark for LLM tool use capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, et al. ToolSandbox: A stateful, conversational, interactive evalua- tion benchmark for LLM tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1160–1183,

2025
[13]

AgentRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942,

Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Sta´nczak, Peter Shaw, Christopher J Pal, and Siva Reddy. AgentRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942,

work page arXiv
[14]

Remote labor index: Measuring AI automation of remote work.arXiv preprint arXiv:2510.26787,

Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, et al. Remote labor index: Measuring AI automation of remote work.arXiv preprint arXiv:2510.26787,

work page arXiv
[15]

SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?arXiv preprint arXiv:2502.12115,

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?arXiv preprint arXiv:2502.12115,

work page arXiv
[16]

ToolComp: A multi-tool reasoning & process supervision benchmark.arXiv preprint arXiv:2501.01290,

Vaskar Nath, Pranav Raja, Claire Yoon, and Sean Hendryx. ToolComp: A multi-tool reasoning & process supervision benchmark.arXiv preprint arXiv:2501.01290,

work page arXiv
[17]

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. GDPval: Evaluating AI model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

The AI productivity index (APEX).arXiv preprint arXiv:2509.25721,

Bertie Vidgen, Abby Fennelly, Evan Pinnix, Julien Benchek, Daniyal Khan, Zach Richards, Austin Bridges, Calix Huang, Kanishka Sahu, Abhishek Kottamasu, et al. The AI productivity index (APEX).arXiv preprint arXiv:2509.25721,

work page arXiv
[20]

Redefining information retrieval of structured database via large language models.arXiv preprint arXiv:2405.05508,

Mingzhu Wang, Yuzhe Zhang, Qihang Zhao, Junyi Yang, and Hong Zhang. Redefining information retrieval of structured database via large language models.arXiv preprint arXiv:2405.05508,

work page arXiv
[21]

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, and Yi Dong. ProfBench: Multi-domain rubrics requiring professional knowledge to answer and judge.arXiv preprint arXiv:2510.18941,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

FAMMA: A benchmark for financial domain multilingual multimodal question answering.arXiv preprint arXiv:2410.04526,

11 Siqiao Xue, Xiaojing Li, Fan Zhou, Qingyang Dai, Zhixuan Chu, and Hongyuan Mei. FAMMA: A benchmark for financial domain multilingual multimodal question answering.arXiv preprint arXiv:2410.04526,

work page arXiv
[24]

The harness recorded visible assistant text and tool calls, but no private chain of thought (thinking=off)

The run is from Claude Opus 4.7 and illustrates a successful EDGAR path: the agent finds the Q3 2025 8-K, follows the SEC filing directory to the press-release exhibit, extracts the ARR and NDRR inputs, and computes the projection. The harness recorded visible assistant text and tool calls, but no private chain of thought (thinking=off). Long URLs and que...

2025
[25]

Correctly identifies

The graph over models and questions is connected, so the model effects are comparable on one scale. The fitted design uses 2,909 RD-clean observations. The relevant diagnostic is not overall fit quality but the relative scale of question and model variation after setup is clean. The centered question effects have standard deviation22.1 percentage points a...

1977
[26]

Other" includes $19.8m of restructuring expenses [+1]Identifies that

Question Reference answer Full rubric If I take Dayforce’s management adjusted reported EBIT as is, would it be overstated or understated or the same last year if I think capitalized software expense is a real cost? If so, by how much? Overstated by $90.1m of excluded capitalized software development costs. Adj EBIT was unburdened by any amortization of c...

2024
[27]

Please round the final answer to the nearest tenth of a billion

Question Reference answer Full rubric What would be the payouts to Spotify’s executive officers if Spotify received a buyout offer at a $800 per share price? Please use the latest disclosed incentive program payments (including options and RSUs), shareholdings, and change of control severance payments in addition to any other compensation schemes as of 9/...

2025
[28]

I’ll research Udemy’s Q3 2025 financials to find the relevant metrics

The model submitted 1.74% YoY growth, a $9.18M ARR increase, and $536.38M implied Q3’26 ARR; both judges marked the final answer correct and awarded32/32rubric points (9/9lines). Step Visible agent text and tool call Key returned content Reward signal 0a“I’ll research Udemy’s Q3 2025 financials to find the relevant metrics. ” edgar_search(ticker=UDMY, for...

2025

[1] [1]

FinTradeBench: A Financial Reasoning Benchmark for LLMs

Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, and Aritra Dutta. Fin- TradeBench: A financial reasoning benchmark for LLMs.arXiv preprint arXiv:2603.19225,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

PRBench: Large-scale expert rubrics for evaluating high-stakes professional reasoning.arXiv preprint arXiv:2511.11562,

Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, et al. PRBench: Large-scale expert rubrics for evaluating high-stakes professional reasoning.arXiv preprint arXiv:2511.11562,

work page arXiv

[3] [3]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Health- Bench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Finance Agent Benchmark: Benchmarking LLMs on real-world financial research tasks.arXiv preprint arXiv:2508.00828,

Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance Agent Benchmark: Benchmarking LLMs on real-world financial research tasks.arXiv preprint arXiv:2508.00828,

work page arXiv

[5] [5]

FinQA: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. FinQA: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711,

2021

[6] [6]

FinDER: Financial dataset for question answering and evaluating retrieval-augmented generation

Chanyeol Choi, Jihoon Kwon, Jaeseon Ha, Hojun Choi, Chaewoon Kim, Yongjae Lee, Jy-yong Sohn, and Alejandro Lopez-Lira. FinDER: Financial dataset for question answering and evaluating retrieval-augmented generation. InProceedings of the 6th ACM International Conference on AI in Finance, pages 638–646, 2025a. Chanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira...

2025

[7] [7]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[9] [9]

Preprint, arXiv:2508.12790

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors.arXiv preprint arXiv:2508.12790,

work page arXiv

[10] [10]

FinanceBench: A New Benchmark for Financial Question Answering

10 Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. Fi- nanceBench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

FinRetrieval: A benchmark for financial data retrieval by AI agents

Eric Y Kim and Jie Huang. FinRetrieval: A benchmark for financial data retrieval by AI agents. arXiv preprint arXiv:2603.04403,

work page arXiv

[12] [12]

ToolSandbox: A stateful, conversational, interactive evalua- tion benchmark for LLM tool use capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, et al. ToolSandbox: A stateful, conversational, interactive evalua- tion benchmark for LLM tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1160–1183,

2025

[13] [13]

AgentRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942,

Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Sta´nczak, Peter Shaw, Christopher J Pal, and Siva Reddy. AgentRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942,

work page arXiv

[14] [14]

Remote labor index: Measuring AI automation of remote work.arXiv preprint arXiv:2510.26787,

Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, et al. Remote labor index: Measuring AI automation of remote work.arXiv preprint arXiv:2510.26787,

work page arXiv

[15] [15]

SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?arXiv preprint arXiv:2502.12115,

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?arXiv preprint arXiv:2502.12115,

work page arXiv

[16] [16]

ToolComp: A multi-tool reasoning & process supervision benchmark.arXiv preprint arXiv:2501.01290,

Vaskar Nath, Pranav Raja, Claire Yoon, and Sean Hendryx. ToolComp: A multi-tool reasoning & process supervision benchmark.arXiv preprint arXiv:2501.01290,

work page arXiv

[17] [17]

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. GDPval: Evaluating AI model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

The AI productivity index (APEX).arXiv preprint arXiv:2509.25721,

Bertie Vidgen, Abby Fennelly, Evan Pinnix, Julien Benchek, Daniyal Khan, Zach Richards, Austin Bridges, Calix Huang, Kanishka Sahu, Abhishek Kottamasu, et al. The AI productivity index (APEX).arXiv preprint arXiv:2509.25721,

work page arXiv

[20] [20]

Redefining information retrieval of structured database via large language models.arXiv preprint arXiv:2405.05508,

Mingzhu Wang, Yuzhe Zhang, Qihang Zhao, Junyi Yang, and Hong Zhang. Redefining information retrieval of structured database via large language models.arXiv preprint arXiv:2405.05508,

work page arXiv

[21] [21]

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, and Yi Dong. ProfBench: Multi-domain rubrics requiring professional knowledge to answer and judge.arXiv preprint arXiv:2510.18941,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

FAMMA: A benchmark for financial domain multilingual multimodal question answering.arXiv preprint arXiv:2410.04526,

11 Siqiao Xue, Xiaojing Li, Fan Zhou, Qingyang Dai, Zhixuan Chu, and Hongyuan Mei. FAMMA: A benchmark for financial domain multilingual multimodal question answering.arXiv preprint arXiv:2410.04526,

work page arXiv

[24] [24]

The harness recorded visible assistant text and tool calls, but no private chain of thought (thinking=off)

The run is from Claude Opus 4.7 and illustrates a successful EDGAR path: the agent finds the Q3 2025 8-K, follows the SEC filing directory to the press-release exhibit, extracts the ARR and NDRR inputs, and computes the projection. The harness recorded visible assistant text and tool calls, but no private chain of thought (thinking=off). Long URLs and que...

2025

[25] [25]

Correctly identifies

The graph over models and questions is connected, so the model effects are comparable on one scale. The fitted design uses 2,909 RD-clean observations. The relevant diagnostic is not overall fit quality but the relative scale of question and model variation after setup is clean. The centered question effects have standard deviation22.1 percentage points a...

1977

[26] [26]

Other" includes $19.8m of restructuring expenses [+1]Identifies that

Question Reference answer Full rubric If I take Dayforce’s management adjusted reported EBIT as is, would it be overstated or understated or the same last year if I think capitalized software expense is a real cost? If so, by how much? Overstated by $90.1m of excluded capitalized software development costs. Adj EBIT was unburdened by any amortization of c...

2024

[27] [27]

Please round the final answer to the nearest tenth of a billion

Question Reference answer Full rubric What would be the payouts to Spotify’s executive officers if Spotify received a buyout offer at a $800 per share price? Please use the latest disclosed incentive program payments (including options and RSUs), shareholdings, and change of control severance payments in addition to any other compensation schemes as of 9/...

2025

[28] [28]

I’ll research Udemy’s Q3 2025 financials to find the relevant metrics

The model submitted 1.74% YoY growth, a $9.18M ARR increase, and $536.38M implied Q3’26 ARR; both judges marked the final answer correct and awarded32/32rubric points (9/9lines). Step Visible agent text and tool call Key returned content Reward signal 0a“I’ll research Udemy’s Q3 2025 financials to find the relevant metrics. ” edgar_search(ticker=UDMY, for...

2025