IPO Finance Agent: Benchmark of LLM Financial Analysts Beyond Finance Agent v2, with Automated Rubric Generation, on the SpaceX (SPCX) IPO

Mostapha Benhenda

arxiv: 2606.23032 · v3 · pith:IMJ5N5QFnew · submitted 2026-06-22 · 💻 cs.AI · q-fin.GN

IPO Finance Agent: Benchmark of LLM Financial Analysts Beyond Finance Agent v2, with Automated Rubric Generation, on the SpaceX (SPCX) IPO

Mostapha Benhenda This is my paper

Pith reviewed 2026-07-01 06:46 UTC · model grok-4.3

classification 💻 cs.AI q-fin.GN

keywords IPO due diligenceLLM financial analysisSpaceX S-1 filingautomated rubric generationcontextual retrievalfinancial agent benchmarkevaluator-optimizer pipeline

0 comments

The pith

Zhipu GLM-5.2 reaches 79.8% accuracy on IPO diligence for SpaceX S-1 filings, exceeding the Finance Agent v2 ceiling of 57.9%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends Finance Agent v2 from periodic SEC filings to the distinct challenges of IPO due diligence on S-1 documents, which contain longer narratives on governance, pro forma accounting, and risk disclosures. It replaces naive chunk retrieval with contextual retrieval to handle document length and adds an evaluator-optimizer pipeline that extracts candidate facts, drafts rubrics, audits them for omissions and hallucinations, and iterates repairs until human review. A dataset of 1,000 questions is created, with 70 public questions released on the SpaceX S-1 filing. On this benchmark the leading model scores 79.8% while a cost-efficient alternative reaches 77.2% at 0.05 USD per query, both above prior leaderboard entries.

Core claim

By improving the agentic harness with contextual retrieval and introducing an automated evaluator-optimizer pipeline for rubric generation, the IPO Finance Agent benchmark evaluates LLMs on 70 SpaceX S-1 diligence questions; Zhipu GLM-5.2 achieves 79.8% accuracy and Xiaomi MiMo-2.5 Pro reaches 77.2% at 0.05 USD per query, both surpassing the Finance Agent v2 ceiling of 57.9% at higher cost.

What carries the argument

The evaluator-optimizer pipeline that extracts facts from model answers, consolidates them into draft criteria, audits for omissions, hallucinations, and redundancy, then iterates with LLM feedback to produce final rubrics.

If this is right

Contextual retrieval enables agentic systems to process lengthy S-1 filings where the prior harness produced no output.
Automated rubric generation allows scalable evaluation of financial tasks with reduced manual effort.
Models can now be ranked on a Pareto frontier of accuracy versus cost for IPO-specific diligence.
The public 70-question set supports reproducible testing while the private remainder reduces contamination risk.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rubric pipeline and retrieval upgrades could be applied to due diligence on other private companies preparing S-1 filings.
Cost-efficient models at 0.05 USD per query open the possibility of routine automated screening for smaller investors or advisors.
If the benchmark generalizes, it would shift evaluation focus from periodic reports to capital-formation and governance analysis.

Load-bearing premise

The 70 public SpaceX questions plus the automated rubric pipeline produce a faithful and unbiased measure of model capability on IPO due diligence.

What would settle it

A human review finding that the generated rubrics systematically omit key S-1 elements such as underwriting risk disclosures or capital-formation details would show the accuracy scores do not reflect true diligence performance.

Figures

Figures reproduced from arXiv: 2606.23032 by Mostapha Benhenda.

read the original abstract

Finance Agent v2 (by Vals AI) has emerged as the reference benchmark for evaluating both Anthropic Claude and OpenAI ChatGPT frontier language models on financial tasks. However, it narrowly deals with periodic reporting from publicly traded companies (SEC 10-K and 10-Q filings), and its agentic harness relies on naive, unenriched chunk retrieval. Neither the task design nor the retrieval approach addresses the distinct challenges of IPO due diligence. SEC S-1 filings combine historical financial statements, governance structures, pro forma and common-control accounting treatments, capital-formation narratives, and underwriting-sensitive risk disclosures within substantially longer documents than typical periodic filings. That is why we introduce IPO Finance Agent, which extends the Finance Agent v2 framework along two directions: task domain and retrieval architecture. During our experiments, the original Finance Agent v2 harness basically failed to deliver any output related to the SpaceX S-1 filing, due to document length. We therefore had to improve the agentic harness with contextual retrieval, a more realistic and industry-standard approach for long documents. We also built a dataset of 1,000 IPO-diligence questions, and publicly release 70 questions on the SpaceX (SPCX) S-1 filing to support reproducibility, while the remainder are held private to guard against benchmark contamination. In addition, we introduce an evaluator-optimizer pipeline to automatically generate evaluation rubrics for the benchmark: candidate facts are extracted from model answers, consolidated into draft criteria, then automatically audited for omissions, hallucinations, mistiered items, and redundancy, with LLM feedback driving iterative repair, targeted enrichment, and deduplication. Human experts only review final rubrics before deployment. Results show that the best-performing evaluated model, Zhipu GLM-5.2, reaches 79.8% accuracy, and the most cost-efficient model on the resulting Pareto frontier, Xiaomi MiMo-2.5 Pro, reaches slightly lower accuracy (77.2%) at 0.05 USD per query, while exceeding the current Finance Agent v2 leaderboard ceiling, Google Gemini 3.5 Flash at 57.9% for 2.51 USD per query, and undercutting even FABv2's cheapest entry (MiniMax M3: 48.3% at 0.32 USD) on cost-efficiency. Code and data are released on GitHub https://github.com/benstaf/ipoagent

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a SpaceX S-1 benchmark plus contextual retrieval and an LLM rubric pipeline, but thin validation leaves the accuracy numbers hard to trust.

read the letter

The core contribution is a new evaluation set for IPO due diligence on the SpaceX S-1 filing. They took the Finance Agent v2 setup, which could not handle the longer document, replaced the chunk retrieval with contextual retrieval, generated 1000 questions (70 public), and added an evaluator-optimizer loop that extracts facts from answers, audits for omissions and hallucinations, and iterates before light human review.

The practical changes are clear: the new retrieval lets the agent produce outputs where the old harness failed, and the GitHub release of code and data supports checking the harness. The reported numbers show Zhipu GLM-5.2 at 79.8% and Xiaomi MiMo-2.5 Pro at 77.2% for low cost, above the prior 57.9% ceiling.

The soft spot is the rubric and question quality. The abstract describes the pipeline but supplies no inter-rater agreement, coverage metrics on governance or accounting items, or ablation of the automated steps. With a narrow public slice of 70 questions drawn from a private pool, it is not obvious that the scores reflect general IPO diligence skill rather than the specific rubric construction. The stress-test note flags exactly this gap, and nothing in the provided text closes it.

This is for groups already running financial-agent benchmarks who want an S-1 test case. It is worth sending to referees because the task extension and retrieval fix are concrete, but the review should focus on rubric validation and question representativeness before the accuracy claims can be taken as settled.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces IPO Finance Agent as an extension of Finance Agent v2 for evaluating LLMs on IPO due diligence, using the SpaceX S-1 filing. It adds contextual retrieval for long documents, a 1,000-question dataset (70 public on SpaceX), and an automated evaluator-optimizer pipeline for rubric generation via fact extraction, consolidation, LLM audit, and human review. Reported results show Zhipu GLM-5.2 at 79.8% accuracy and Xiaomi MiMo-2.5 Pro at 77.2% (0.05 USD/query), exceeding the Finance Agent v2 ceiling of 57.9%; code and data are released on GitHub.

Significance. If the benchmark holds, the work supplies a new evaluation domain for financial LLMs that targets S-1-specific elements (governance, pro-forma accounting, underwriting risks) absent from periodic-filing benchmarks. The public code/data release and automated rubric pipeline are concrete strengths that enable reproducibility and extension by others.

major comments (3)

[Rubric Generation] § on automated rubric generation (evaluator-optimizer pipeline): no inter-rater agreement, coverage statistics for key IPO topics (governance, pro-forma treatments, risk disclosures), or ablation results are reported. Without these, the headline accuracies (79.8%, 77.2%) cannot be shown to measure genuine diligence capability rather than pipeline artifacts.
[Dataset Construction] Dataset section: the 70 public SpaceX questions are sampled from a 1,000-question pool, yet no sampling protocol, representativeness metrics, or public-vs-private coverage comparison is supplied. This directly affects whether the cost-accuracy Pareto claims generalize beyond the released subset.
[Results] Results and comparison paragraphs: the claim that new results exceed the Finance Agent v2 ceiling of 57.9% treats the two benchmarks as comparable, but the manuscript itself notes that S-1 filings differ in length, accounting treatments, and risk focus from 10-K/10-Q tasks; no alignment or difficulty calibration is provided to support the cross-benchmark ranking.

minor comments (2)

[Agentic Harness] The description of the contextual retrieval harness would be clearer with a short pseudocode block or flowchart.
[Reproducibility] Ensure the GitHub repository link remains stable and includes the exact 70-question set and rubric examples used for the reported numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments identify clear opportunities to strengthen the presentation of the automated rubric pipeline, dataset details, and cross-benchmark comparisons. We address each point below and indicate the revisions we will make in the next version of the manuscript.

read point-by-point responses

Referee: [Rubric Generation] § on automated rubric generation (evaluator-optimizer pipeline): no inter-rater agreement, coverage statistics for key IPO topics (governance, pro-forma treatments, risk disclosures), or ablation results are reported. Without these, the headline accuracies (79.8%, 77.2%) cannot be shown to measure genuine diligence capability rather than pipeline artifacts.

Authors: We agree that quantitative validation of the rubric pipeline is necessary. The current pipeline already includes an LLM audit step for omissions, hallucinations, and redundancy followed by human expert review of final rubrics. In the revised manuscript we will add: (1) coverage statistics showing the distribution of rubric criteria across core IPO topics (governance, pro-forma accounting, risk disclosures, etc.); (2) ablation results comparing accuracy with and without the consolidation and audit stages; and (3) inter-rater agreement statistics from the human review phase (Cohen’s kappa or equivalent). These additions will be placed in a new subsection under the rubric-generation description and will directly address whether the reported accuracies reflect pipeline artifacts. revision: partial
Referee: [Dataset Construction] Dataset section: the 70 public SpaceX questions are sampled from a 1,000-question pool, yet no sampling protocol, representativeness metrics, or public-vs-private coverage comparison is supplied. This directly affects whether the cost-accuracy Pareto claims generalize beyond the released subset.

Authors: We accept this criticism. The 1,000-question pool was constructed by systematically extracting factual statements from each major section of the SpaceX S-1 and converting them into questions; the 70 public questions were chosen to span the same topic distribution while remaining self-contained. In revision we will: (a) explicitly describe the sampling protocol and stratification criteria; (b) report topic-level representativeness metrics (e.g., percentage of questions per S-1 section); and (c) provide a high-level comparison of topic coverage between the public 70 and the private remainder without disclosing private content. These details will be added to the Dataset Construction section. revision: yes
Referee: [Results] Results and comparison paragraphs: the claim that new results exceed the Finance Agent v2 ceiling of 57.9% treats the two benchmarks as comparable, but the manuscript itself notes that S-1 filings differ in length, accounting treatments, and risk focus from 10-K/10-Q tasks; no alignment or difficulty calibration is provided to support the cross-benchmark ranking.

Authors: We acknowledge the overstatement. The manuscript already notes the structural differences between S-1 and periodic filings, yet the comparison language implies direct equivalence. In the revised version we will: (1) remove or qualify the phrase “exceeding the Finance Agent v2 ceiling” and instead state that IPO Finance Agent achieves higher accuracy on a distinct, longer-document task with contextual retrieval; (2) add an explicit limitations paragraph clarifying the absence of difficulty calibration and the non-comparability of the two benchmarks; and (3) frame the cost-efficiency results strictly within the IPO setting. These changes will appear in the Results and Discussion sections. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark results independent of inputs

full rationale

The paper reports direct empirical accuracies (e.g., GLM-5.2 at 79.8%) obtained by running models on a fixed set of 70 public questions drawn from a SpaceX S-1 filing, scored via an automated rubric pipeline. No equations, fitted parameters, predictions, or derivations are present. The methodology describes question construction and rubric generation but does not reduce any reported result to its own inputs by construction. Comparisons to Finance Agent v2 are external benchmarks, not self-referential. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms; the work rests on standard assumptions that accuracy on a fixed question set measures capability and that LLM-based rubric auditing is reliable after human review.

pith-pipeline@v0.9.1-grok · 6002 in / 1108 out tokens · 38078 ms · 2026-07-01T06:46:58.672170+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 12 canonical work pages · 8 internal anchors

[1]

Building Effective Agents

Anthropic. Building Effective Agents. Dec 19, 2024. https://www.anthropic.com/ engineering/building-effective-agents

2024
[2]

Introducing Contextual Retrieval

Anthropic. Introducing Contextual Retrieval. Sep 19, 2024. https://www.anthropic.com/ news/contextual-retrieval

2024
[3]

Claude for Financial Services

Anthropic. Claude for Financial Services. Jul 15, 2025. https://www.anthropic.com/ news/claude-for-financial-services

2025
[4]

Advancing Claude for Financial Services

Anthropic. Advancing Claude for Financial Services. Oct 27, 2025.https://www.anthropic. com/news/advancing-claude-for-financial-services

2025
[5]

Introducing Claude Opus 4.8

Anthropic. Introducing Claude Opus 4.8. May 28, 2026. https://www.anthropic.com/ news/claude-opus-4-8

2026
[6]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

FinQA: A Dataset of Numerical Reasoning over Financial Data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA: A Dataset of Numerical Reasoning over Financial Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021

2021
[8]

ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering

Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. 11

2022
[9]

Securities and Exchange Commission

U.S. Securities and Exchange Commission. EDGAR — Electronic Data Gathering, Analysis, and Retrieval System. https://www.sec.gov/about/about-edgar
[10]

Finance agent benchmark: Benchmarking llms on real-world financial research tasks.arXiv preprint arXiv:2508.00828, 2025

Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks. arXiv preprint arXiv:2508.00828, 2025

work page arXiv 2025
[11]

Finance Agent v2 Benchmark

Vals AI. Finance Agent v2 Benchmark. https://www.vals.ai/benchmarks/fabv2
[12]

FinanceBench: A New Benchmark for Financial Question Answering

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. FinanceBench: A New Benchmark for Financial Question Answering. arXiv preprint arXiv:2311.11944, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

FinBen: A Holistic Financial Benchmark for Large Language Models

Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al. FinBen: A Holistic Financial Benchmark for Large Language Models. Advances in Neural Information Processing Systems 37 (NeurIPS Datasets and Benchmarks Track), 2024

2024
[14]

FinRetrieval: A benchmark for financial data retrieval by AI agents

Eric Y . Kim and Jie Huang. FinRetrieval: A Benchmark for Financial Data Retrieval by AI Agents. arXiv preprint arXiv:2603.04403, 2026

work page arXiv 2026
[15]

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Shahriar Golchin and Mihai Surdeanu. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. International Conference on Learning Representations (ICLR), 2024

2024
[16]

Jimenez, John Yang, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

Carlos E. Jimenez, John Yang, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? International Conference on Learning Representations (ICLR), 2024

2024
[17]

Dense Passage Retrieval for Open-Domain Question Answering

Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020
[18]

ColBERT: Efficient and Effective Passage Search via Con- textualized Late Interaction over BERT

Omar Khattab and Matei Zaharia. ColBERT: Efficient and Effective Passage Search via Con- textualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2020

2020
[19]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems (NeurIPS), 2020

2020
[20]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic Evaluation of Language Models. Transactions on Machine Learning Research (TMLR), 2023

2023
[21]

IPO Underpricing

Alexander Ljungqvist. IPO Underpricing. In B. Espen Eckbo (ed.), Handbook of Corporate Finance: Empirical Corporate Finance, V olume 1, Chapter 7, Elsevier, 2007

2007
[22]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

2023
[24]

Initial Public Offerings: A Synthesis of the Literature and Directions for Future Research

Michelle Lowry, Roni Michaely, and Ekaterina V olkova. Initial Public Offerings: A Synthesis of the Literature and Directions for Future Research. Foundations and Trends in Finance, 11(3-4):154–320, 2017

2017
[25]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-Refine: Iterative Refinement with Self-Feedback. Advances in Neural Information Processing Systems 36 (NeurIPS), 2023. 12

2023
[26]

GAIA: A Benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A Benchmark for General AI Assistants. International Conference on Learning Representations (ICLR), 2024

2024
[27]

Kimi K2.6 is now open-weight #1 on Finance Agent Bench- mark V2

Moonshot AI (Kimi). “Kimi K2.6 is now open-weight #1 on Finance Agent Bench- mark V2.” Post on X-Twitter, May 14, 2026. https://x.com/Kimi_Moonshot/status/ 2054803169994272819

2026
[28]

MultiFinBen: A Multilingual, Multimodal, and Difficulty- Aware Benchmark for Financial LLM Evaluation

Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Jeff Zhao, Huan He, Yi Han, et al. MultiFinBen: A Multilingual, Multimodal, and Difficulty- Aware Benchmark for Financial LLM Evaluation. arXiv preprint arXiv:2506.14028, 2025

work page arXiv 2025
[29]

Introducing GPT-5.5

OpenAI. Introducing GPT-5.5. April 23, 2026. https://openai.com/index/ introducing-gpt-5-5/

2026
[30]

Jay R. Ritter. The Long-Run Performance of Initial Public Offerings. Journal of Finance, 46(1):3–27, 1991

1991
[31]

Ritter and Ivo Welch

Jay R. Ritter and Ivo Welch. A Review of IPO Activity, Pricing, and Allocations. Journal of Finance, 57(4):1795–1828, 2002

2002
[32]

Beyond the Imitation Game: Quantify- ing and Extrapolating the Capabilities of Language Models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. Beyond the Imitation Game: Quantify- ing and Extrapolating the Capabilities of Language Models. Transactions on Machine Learning Research (TMLR), 2023

2023
[33]

Form S-1 Registration Statement

Space Exploration Technologies Corp. Form S-1 Registration Statement. U.S. Securities and Exchange Commission, May 20, 2026. https://www.sec.gov/Archives/edgar/data/ 1181412/000162828026036936/spaceexplorationtechnologi.htm

2026
[34]

VCBench: Benchmarking LLMs in Venture Capital

Rick Chen, Joseph Ternasky, Afriyie Samuel Kwesi, Ben Griffin, Aaron Ontoyin Yin, Zakari Salifu, Kelvin Amoaba, Xianling Mu, Fuat Alican, and Yigit Ihlamur. VCBench: Benchmarking LLMs in Venture Capital. arXiv preprint arXiv:2509.14448, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

Thomson Yen et al. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance. arXiv preprint arXiv:2605.22664, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

YC Bench: a Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches

Mostapha Benhenda. YC Bench: A Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches. arXiv preprint arXiv:2604.02378, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021

2021
[39]

Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631, 2023

Lianghui Zhu, Xinggang Wang, and Xinlong Wang. JudgeLM: Fine-tuned Large Language Models are Scalable Judges. arXiv preprint arXiv:2310.17631, 2023

work page arXiv 2023
[40]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Building Effective Agents

Anthropic. Building Effective Agents. Dec 19, 2024. https://www.anthropic.com/ engineering/building-effective-agents

2024

[2] [2]

Introducing Contextual Retrieval

Anthropic. Introducing Contextual Retrieval. Sep 19, 2024. https://www.anthropic.com/ news/contextual-retrieval

2024

[3] [3]

Claude for Financial Services

Anthropic. Claude for Financial Services. Jul 15, 2025. https://www.anthropic.com/ news/claude-for-financial-services

2025

[4] [4]

Advancing Claude for Financial Services

Anthropic. Advancing Claude for Financial Services. Oct 27, 2025.https://www.anthropic. com/news/advancing-claude-for-financial-services

2025

[5] [5]

Introducing Claude Opus 4.8

Anthropic. Introducing Claude Opus 4.8. May 28, 2026. https://www.anthropic.com/ news/claude-opus-4-8

2026

[6] [6]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

FinQA: A Dataset of Numerical Reasoning over Financial Data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA: A Dataset of Numerical Reasoning over Financial Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021

2021

[8] [8]

ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering

Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. 11

2022

[9] [9]

Securities and Exchange Commission

U.S. Securities and Exchange Commission. EDGAR — Electronic Data Gathering, Analysis, and Retrieval System. https://www.sec.gov/about/about-edgar

[10] [10]

Finance agent benchmark: Benchmarking llms on real-world financial research tasks.arXiv preprint arXiv:2508.00828, 2025

Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks. arXiv preprint arXiv:2508.00828, 2025

work page arXiv 2025

[11] [11]

Finance Agent v2 Benchmark

Vals AI. Finance Agent v2 Benchmark. https://www.vals.ai/benchmarks/fabv2

[12] [12]

FinanceBench: A New Benchmark for Financial Question Answering

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. FinanceBench: A New Benchmark for Financial Question Answering. arXiv preprint arXiv:2311.11944, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

FinBen: A Holistic Financial Benchmark for Large Language Models

Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al. FinBen: A Holistic Financial Benchmark for Large Language Models. Advances in Neural Information Processing Systems 37 (NeurIPS Datasets and Benchmarks Track), 2024

2024

[14] [14]

FinRetrieval: A benchmark for financial data retrieval by AI agents

Eric Y . Kim and Jie Huang. FinRetrieval: A Benchmark for Financial Data Retrieval by AI Agents. arXiv preprint arXiv:2603.04403, 2026

work page arXiv 2026

[15] [15]

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Shahriar Golchin and Mihai Surdeanu. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. International Conference on Learning Representations (ICLR), 2024

2024

[16] [16]

Jimenez, John Yang, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

Carlos E. Jimenez, John Yang, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? International Conference on Learning Representations (ICLR), 2024

2024

[17] [17]

Dense Passage Retrieval for Open-Domain Question Answering

Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020

[18] [18]

ColBERT: Efficient and Effective Passage Search via Con- textualized Late Interaction over BERT

Omar Khattab and Matei Zaharia. ColBERT: Efficient and Effective Passage Search via Con- textualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2020

2020

[19] [19]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems (NeurIPS), 2020

2020

[20] [20]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic Evaluation of Language Models. Transactions on Machine Learning Research (TMLR), 2023

2023

[21] [21]

IPO Underpricing

Alexander Ljungqvist. IPO Underpricing. In B. Espen Eckbo (ed.), Handbook of Corporate Finance: Empirical Corporate Finance, V olume 1, Chapter 7, Elsevier, 2007

2007

[22] [22]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

2023

[24] [24]

Initial Public Offerings: A Synthesis of the Literature and Directions for Future Research

Michelle Lowry, Roni Michaely, and Ekaterina V olkova. Initial Public Offerings: A Synthesis of the Literature and Directions for Future Research. Foundations and Trends in Finance, 11(3-4):154–320, 2017

2017

[25] [25]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-Refine: Iterative Refinement with Self-Feedback. Advances in Neural Information Processing Systems 36 (NeurIPS), 2023. 12

2023

[26] [26]

GAIA: A Benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A Benchmark for General AI Assistants. International Conference on Learning Representations (ICLR), 2024

2024

[27] [27]

Kimi K2.6 is now open-weight #1 on Finance Agent Bench- mark V2

Moonshot AI (Kimi). “Kimi K2.6 is now open-weight #1 on Finance Agent Bench- mark V2.” Post on X-Twitter, May 14, 2026. https://x.com/Kimi_Moonshot/status/ 2054803169994272819

2026

[28] [28]

MultiFinBen: A Multilingual, Multimodal, and Difficulty- Aware Benchmark for Financial LLM Evaluation

Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Jeff Zhao, Huan He, Yi Han, et al. MultiFinBen: A Multilingual, Multimodal, and Difficulty- Aware Benchmark for Financial LLM Evaluation. arXiv preprint arXiv:2506.14028, 2025

work page arXiv 2025

[29] [29]

Introducing GPT-5.5

OpenAI. Introducing GPT-5.5. April 23, 2026. https://openai.com/index/ introducing-gpt-5-5/

2026

[30] [30]

Jay R. Ritter. The Long-Run Performance of Initial Public Offerings. Journal of Finance, 46(1):3–27, 1991

1991

[31] [31]

Ritter and Ivo Welch

Jay R. Ritter and Ivo Welch. A Review of IPO Activity, Pricing, and Allocations. Journal of Finance, 57(4):1795–1828, 2002

2002

[32] [32]

Beyond the Imitation Game: Quantify- ing and Extrapolating the Capabilities of Language Models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. Beyond the Imitation Game: Quantify- ing and Extrapolating the Capabilities of Language Models. Transactions on Machine Learning Research (TMLR), 2023

2023

[33] [33]

Form S-1 Registration Statement

Space Exploration Technologies Corp. Form S-1 Registration Statement. U.S. Securities and Exchange Commission, May 20, 2026. https://www.sec.gov/Archives/edgar/data/ 1181412/000162828026036936/spaceexplorationtechnologi.htm

2026

[34] [34]

VCBench: Benchmarking LLMs in Venture Capital

Rick Chen, Joseph Ternasky, Afriyie Samuel Kwesi, Ben Griffin, Aaron Ontoyin Yin, Zakari Salifu, Kelvin Amoaba, Xianling Mu, Fuat Alican, and Yigit Ihlamur. VCBench: Benchmarking LLMs in Venture Capital. arXiv preprint arXiv:2509.14448, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

Thomson Yen et al. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance. arXiv preprint arXiv:2605.22664, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

YC Bench: a Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches

Mostapha Benhenda. YC Bench: A Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches. arXiv preprint arXiv:2604.02378, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021

2021

[39] [39]

Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631, 2023

Lianghui Zhu, Xinggang Wang, and Xinlong Wang. JudgeLM: Fine-tuned Large Language Models are Scalable Judges. arXiv preprint arXiv:2310.17631, 2023

work page arXiv 2023

[40] [40]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023