pith. sign in

arxiv: 2512.13168 · v5 · submitted 2025-12-15 · 💻 cs.AI · cs.CE· cs.IR· cs.MA

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Pith reviewed 2026-05-16 22:39 UTC · model grok-4.3

classification 💻 cs.AI cs.CEcs.IRcs.MA
keywords AI benchmarkfinance workflowsenterprise AIspreadsheet tasksagent evaluationEnron dataaccounting processesmultimodal workflows
0
0 comments X

The pith

Finch introduces a benchmark of 172 real enterprise finance workflows to test AI agents on authentic spreadsheet-centric tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a benchmark called Finch for evaluating AI agents on realistic finance and accounting workflows. The workflows come from authentic enterprise sources such as Enron files and emails spanning 2000 to 2025 and interleave data entry, calculations, searches, formatting, and reporting across many spreadsheets and documents. A sympathetic reader would care because most current AI tests rely on simplified or isolated problems, whereas Finch preserves the messy, long-horizon, knowledge-intensive character of actual professional work, with evaluations showing frontier models like GPT-5.1 passing only 38.4 percent of workflows even after spending nearly 17 minutes on each.

Core claim

Finch consists of 172 composite workflows containing 384 tasks that involve 1,710 spreadsheets with 27 million cells plus PDFs and other artifacts. These were produced by an LLM-assisted mining process applied to real email threads and spreadsheet version histories, followed by over 700 hours of expert annotation to maintain fidelity to in-the-wild enterprise conditions. Human evaluations of frontier models including GPT-5.1 Pro, Claude Sonnet 4.5, and Gemini 3 Pro reveal low overall pass rates and expose specific difficulties with the interleaved and collaborative demands of the tasks.

What carries the argument

The workflow construction process that combines LLM-assisted extraction from authentic enterprise email threads and spreadsheet histories with extensive expert annotation to produce composite long-horizon tasks.

If this is right

  • Frontier AI systems spend substantial time yet complete only a minority of the realistic workflows under human evaluation.
  • The benchmark surfaces challenges in managing multimodal artifacts, cross-file retrieval, and long sequences of interleaved skills.
  • Models must improve at preserving context and performing validation across extended enterprise processes.
  • Detailed case studies identify concrete failure modes that simpler benchmarks do not reveal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar construction methods could generate benchmarks for other professional domains such as legal or supply-chain workflows.
  • Low success rates suggest that scaling current architectures may be insufficient without new mechanisms for sustained planning and collaboration.
  • Enterprises could adopt Finch-style tests to assess readiness before deploying agents in live finance operations.
  • The annotation effort provides a reusable template for creating high-fidelity evaluation data in data-intensive fields.

Load-bearing premise

The 172 workflows mined from Enron and other 2000-2025 financial institution data, after LLM-assisted extraction and expert annotation, faithfully represent the distribution and difficulty of typical enterprise finance and accounting work.

What would settle it

Extracting an independent collection of workflows from a fresh set of financial institution archives and re-running the same AI agents to determine whether the observed pass rates and failure patterns remain consistent.

Figures

Figures reproduced from arXiv: 2512.13168 by Adina Yakefu, Haoyu Dong, Mingzhe Lu, Pengkun Zhang, Shuxin Zheng, Xuanyu Dong, Yan Gao, Yilin Cheng, Zikun Zhu.

Figure 1
Figure 1. Figure 1: Real-world F&A work is messy, spanning heterogeneous and large-scale artifacts such [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Model pass-rate comparison on FINCH workflows. Bars show overall workflow success rates for product-side agents and API-based models. Detailed settings can be found in Section 3. We evaluate a spectrum of frontier AI systems—including Claude Sonnet 4.5, GPT 5.1, Gemini 3, Grok 4, and Qwen 3—using both expert evaluation and a novel automated evaluation pipeline that closely aligns with expert judgments. Our… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of an end-to-end predictive modeling workflow typically performed by financial [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of number of tasks per workflow and task types across business types. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of the number of sheets and cells per workflow. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of our automated evaluation pipeline. Here, [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pass rate comparison for GPT 5.1 Pro and Claude Sonnet 4.5 across different task combina [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: For this task, the model must verify the department headcount summary by cross-checking [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example of extracting data from tables and charts in PDFs and saving it to a spreadsheet. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Cross-sheet reference validation. This example is relatively easy for frontier AI agents. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: This task requires deriving the XNPV5 of the contract under different combinations of [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The sum of A&B and the equity roll-forward test require cross-sheet retrieval and [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: A workflow that translates a French report into English while preserving its format and [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Transforming a table from one structure to another requires reorganizing data and re [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The apparent semantics from the headers suggest a monthly/- [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: This workflow requires creating a new spreadsheet with all values converted to USD. [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Generating reports from tabular data requires financial knowledge of data analysis, [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: This Excel sheet shows an assumption-update workflow, where a mix of forward contracts [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
read the original abstract

We introduce FinWorkBench (a.k.a. Finch) for evaluating AI agents on real-world, enterprise-grade finance and accounting workflows that interleave data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces from Enron (15,000 files and 500,000 emails) and other financial institutions, covering the period 2000--2025 and preserving the in-the-wild messiness of multimodal artifacts such as tables and charts across diverse domains including budgeting, trading, asset management, and operational management. We propose a workflow construction process that combines LLM-assisted mining of workflows from authentic enterprise environments with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and spreadsheet version histories, and (2) meticulous annotation requiring over 700 hours of expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems, including GPT-5.1, Claude Sonnet 4.5, Claude Opus 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max. Under human evaluation, GPT-5.1 Pro spends an average of 16.8 minutes per workflow yet passes only 38.4% of workflows. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FinWorkBench (Finch), a benchmark of 172 composite workflows (384 tasks) mined from Enron (15k files, 500k emails) and other 2000-2025 institutional data. Workflows are built via LLM-assisted extraction plus >700 hours of expert annotation, spanning 1,710 spreadsheets (27M cells) plus PDFs and artifacts; they interleave data entry, cross-file retrieval, calculation, modeling, visualization, and reporting. Human evaluations show frontier models (GPT-5.1, Claude Sonnet/Opus 4.5, Gemini 3 Pro, etc.) achieve low pass rates, e.g., GPT-5.1 at 38.4 % despite averaging 16.8 min per workflow.

Significance. If the workflows are shown to be representative of typical enterprise distributions, Finch would provide a useful, high-fidelity testbed that exposes gaps in current agents on long-horizon, multimodal, knowledge-intensive finance tasks that existing synthetic benchmarks do not capture.

major comments (2)
  1. [§3 (Workflow Construction)] §3 (Workflow Construction): No inter-annotator agreement statistics, sampling frame, or external validation against industry task distributions are reported for the LLM-assisted extraction and 700-hour expert annotation process. This directly weakens the central claim that the 172 workflows faithfully represent the difficulty and mix of real enterprise finance/accounting work, especially given documented irregularities in the Enron corpus.
  2. [§5 (Experiments)] §5 (Experiments): The reported pass rates (e.g., GPT-5.1 at 38.4 %) are presented only in aggregate; no per-component breakdown (calculation vs. retrieval vs. validation) or comparison to simpler baselines is given, limiting the ability to diagnose load-bearing failure modes.
minor comments (2)
  1. [Abstract] Abstract and §1: The model name 'GPT-5.1' should be explicitly defined or footnoted in the main text to avoid ambiguity about whether it refers to a released or hypothetical system.
  2. [Related Work] Related Work: A short comparison table or paragraph contrasting Finch with prior spreadsheet benchmarks (formula prediction, table QA) would clarify the added value of the interleaved, multi-artifact workflow setting.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on workflow construction and experimental reporting. We address each point below and will revise the manuscript to provide greater transparency on the annotation process and more granular experimental breakdowns.

read point-by-point responses
  1. Referee: [§3 (Workflow Construction)] §3 (Workflow Construction): No inter-annotator agreement statistics, sampling frame, or external validation against industry task distributions are reported for the LLM-assisted extraction and 700-hour expert annotation process. This directly weakens the central claim that the 172 workflows faithfully represent the difficulty and mix of real enterprise finance/accounting work, especially given documented irregularities in the Enron corpus.

    Authors: We agree that additional details on the construction process would strengthen the representativeness claim. The workflows were derived via LLM-assisted extraction from the Enron corpus and other 2000-2025 institutional sources, followed by sequential expert verification exceeding 700 hours. We will revise §3 to include a full description of the sampling frame, the annotation protocol with quality-control steps, and explicit discussion of Enron irregularities along with our mitigation via expert curation. Formal inter-annotator agreement statistics are not available because the process used sequential rather than parallel independent annotations; we will note this limitation. External validation against comprehensive industry distributions is not feasible with available data, but we will add a discussion of domain coverage across budgeting, trading, asset management, and operational management to support the claim of enterprise fidelity. revision: partial

  2. Referee: [§5 (Experiments)] §5 (Experiments): The reported pass rates (e.g., GPT-5.1 at 38.4 %) are presented only in aggregate; no per-component breakdown (calculation vs. retrieval vs. validation) or comparison to simpler baselines is given, limiting the ability to diagnose load-bearing failure modes.

    Authors: We agree that aggregate results limit diagnostic value. In the revision we will add a per-component breakdown of success rates across the 384 tasks, separating performance on data entry, cross-file retrieval, calculation/modeling, validation, visualization, and reporting. We will also include comparisons against simpler baselines such as single-task LLMs and basic retrieval agents to isolate the impact of long-horizon composition. These additions, including updated tables, will be incorporated into the revised §5. revision: yes

standing simulated objections not resolved
  • Formal inter-annotator agreement statistics and external validation against industry task distributions cannot be provided because they were not collected during the sequential annotation process and no such reference datasets were available.

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction from external sources

full rationale

The paper introduces FinWorkBench by mining 172 workflows from external Enron (15k files) and other 2000-2025 institutional data using LLM-assisted extraction followed by 700+ hours of expert annotation. No equations, parameter fitting, predictions, or derivations are present; the central claim is the benchmark's creation and evaluation on frontier models. All load-bearing steps rely on external data and human annotation rather than self-definition or self-citation chains, making the work self-contained as an empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the chosen enterprise artifacts and expert annotation process produce representative workflows; no free parameters, new entities, or mathematical axioms are introduced.

axioms (1)
  • domain assumption Real-world enterprise data from Enron and similar institutions, after expert annotation, accurately reflects typical finance and accounting workflows.
    This underpins the claim that Finch captures authentic enterprise challenges.

pith-pipeline@v0.9.0 · 5642 in / 1305 out tokens · 36925 ms · 2026-05-16T22:39:59.448024+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

    cs.AI 2026-05 unverdicted novelty 7.0

    WorkstreamBench evaluates LLM agents on end-to-end financial spreadsheet creation and finds that even top models like Claude fall short of professional standards, with performance dropping sharply on complex tasks.

  2. Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

    cs.MA 2026-05 unverdicted novelty 7.0

    EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-speciali...

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    Shortcut ai for spreadsheets.https://www.tryshortcut.ai/, 2024

    Shortcut AI. Shortcut ai for spreadsheets.https://www.tryshortcut.ai/, 2024

  2. [2]

    Claude for excel.https://claude.com/claude-for-excel, 2025

    Anthropic. Claude for excel.https://claude.com/claude-for-excel, 2025

  3. [3]

    Introducing claude sonnet 4.5

    Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/claude-s onnet-4-5, 2025. Accessed: 2025-12-14

  4. [4]

    World Bank, Washington, DC, 2024

    World Bank.International Debt Report 2024. World Bank, Washington, DC, 2024. World Bank’s annual publication on external debt statistics

  5. [5]

    Finance agent benchmark: Benchmarking llms on real-world financial research tasks.arXiv preprint arXiv:2508.00828, 2025

    Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance agent benchmark: Benchmarking llms on real-world financial research tasks.arXiv preprint arXiv:2508.00828, 2025

  6. [6]

    Auto-formula: Recommend formulas in spreadsheets using contrastive learning for table representations.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024

    Sibei Chen, Yeye He, Weiwei Cui, Ju Fan, Song Ge, Haidong Zhang, Dongmei Zhang, and Surajit Chaudhuri. Auto-formula: Recommend formulas in spreadsheets using contrastive learning for table representations.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024

  7. [7]

    Sheetagent: towards a generalist agent for spreadsheet reasoning and manipulation via large language models

    Yibin Chen, Yifu Yuan, Zeyu Zhang, Yan Zheng, Jinyi Liu, Fei Ni, Jianye Hao, Hangyu Mao, and Fuzheng Zhang. Sheetagent: towards a generalist agent for spreadsheet reasoning and manipulation via large language models. InProceedings of the ACM on Web Conference 2025, pages 158–177, 2025

  8. [8]

    Finqa: A dataset of numerical reasoning over financial data

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, 2021

  9. [9]

    Fortap: Using formulas for numerical-reasoning-aware table pretraining

    Zhoujun Cheng, Haoyu Dong, Ran Jia, Pengfei Wu, Shi Han, Fan Cheng, and Dongmei Zhang. Fortap: Using formulas for numerical-reasoning-aware table pretraining. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1150–1166, 2022

  10. [10]

    Finagentbench: A benchmark dataset for agentic retrieval in financial question answering

    Chanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira, Chaewoon Kim, Minjae Kim, Juneha Hwang, Jaeseon Ha, Hojun Choi, Suyeol Yun, Yongjin Kim, et al. Finagentbench: A benchmark dataset for agentic retrieval in financial question answering. InProceedings of the 6th ACM International Conference on AI in Finance, pages 632–637, 2025

  11. [11]

    Fiscal reference tables, november 2025

    Department of Finance Canada. Fiscal reference tables, november 2025. Technical report, Government of Canada, Ottawa, Canada, 2025. Provides annual data on the financial position of the federal, provincial-territorial and local governments

  12. [12]

    Reasoning and retrieval for complex semi-structured tables via reinforced relational data transformation

    Haoyu Dong, Yue Hu, and Yanan Cao. Reasoning and retrieval for complex semi-structured tables via reinforced relational data transformation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1382– 1391, 2025

  13. [13]

    Tablesense: Spreadsheet table detection with convolutional neural networks

    Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, and Dongmei Zhang. Tablesense: Spreadsheet table detection with convolutional neural networks. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 69–76, 2019

  14. [14]

    Ttc-quali: A text-table-chart dataset for multimodal quantity alignment

    Haoyu Dong, Haochen Wang, Anda Zhou, and Yue Hu. Ttc-quali: A text-table-chart dataset for multimodal quantity alignment. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 181–189, 2024

  15. [15]

    Neural formatting for spreadsheet tables

    Haoyu Dong, Jinyu Wang, Zhouyu Fu, Shi Han, and Dongmei Zhang. Neural formatting for spreadsheet tables. InProceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 305–314, 2020

  16. [16]

    MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

    Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, and Guolin Ke. Machinelearn- inglm: Scaling many-shot in-context learning via continued pretraining.arXiv preprint arXiv:2509.06806, 2025. 13

  17. [17]

    Spreadsheetllm: Encodi ng spreadsheets for large language models

    Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Junyu Xiong, Shiyu Xia, Mengyu Zhou, Yun Lin, José Cambronero, Yeye He, Shi Han, et al. Spreadsheetllm: encoding spreadsheets for large language models.arXiv preprint arXiv:2407.09025, 2024

  18. [18]

    Dabstep: Data agent benchmark for multi-step reasoning.arXiv preprint arXiv:2506.23719, 2025

    Alex Egg, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, and Thomas Wolf. Dabstep: Data agent benchmark for multi-step reasoning.arXiv preprint arXiv:2506.23719, 2025

  19. [19]

    EnronData.org

    EnronData.org. Edo enron email pst dataset. https://enrondata.readthedocs.io/e n/latest/data/edo-enron-email-pst-dataset/ . Creative Commons Attribution 3.0 United States License. To provide attribution, please cite to “EnronData.org.”

  20. [20]

    The euses spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms

    Marc Fisher and Gregg Rothermel. The euses spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. InProceedings of the first workshop on End-user software engineering, pages 1–5, 2005

  21. [21]

    Gemini for google workspace.https://workspace.google.com/solutions/ai/, 2024

    Google. Gemini for google workspace.https://workspace.google.com/solutions/ai/, 2024

  22. [22]

    Gemini 3 pro

    Google DeepMind. Gemini 3 pro. https://deepmind.google/models/gemini/pro/ ,

  23. [23]

    Accessed: 2025-12-14

  24. [24]

    Hermes: Interactive spreadsheet formula prediction via hierarchical formulet expansion

    Wanrong He, Haoyu Dong, Yihuai Gao, Zhichao Fan, Xingzhuo Guo, Zhitao Hou, Xiao Lv, Ran Jia, Shi Han, and Dongmei Zhang. Hermes: Interactive spreadsheet formula prediction via hierarchical formulet expansion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8356–8372, 2023

  25. [25]

    Public expenditure statistical analyses 2023

    HM Treasury. Public expenditure statistical analyses 2023. Technical report, HM Treasury, London, United Kingdom, 2023. UK public expenditure statistical release (PESA)

  26. [26]

    Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning.arXiv preprint arXiv:2509.13160, 2025

    Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, et al. Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning.arXiv preprint arXiv:2509.13160, 2025

  27. [27]

    Sodbench: A large language model approach to documenting spreadsheet operations.arXiv preprint arXiv:2510.19864, 2025

    Amila Indika and Igor Molybog. Sodbench: A large language model approach to documenting spreadsheet operations.arXiv preprint arXiv:2510.19864, 2025

  28. [28]

    FinanceBench: A New Benchmark for Financial Question Answering

    Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

  29. [29]

    The enron corpus: A new dataset for email classification research

    Bryan Klimt and Yiming Yang. The enron corpus: A new dataset for email classification research. InEuropean conference on machine learning, pages 217–226. Springer, 2004

  30. [30]

    Investorbench: A benchmark for financial decision-making tasks with llm-based agent

    Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, Kp Subbalakshmi, Jimin Huang, et al. Investorbench: A benchmark for financial decision-making tasks with llm-based agent. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2...

  31. [31]

    Sheetcopilot: Bringing software productivity to the next level through large language models.Advances in Neural Information Processing Systems, 36:4952–4984, 2023

    Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and Zhao-Xiang Zhang. Sheetcopilot: Bringing software productivity to the next level through large language models.Advances in Neural Information Processing Systems, 36:4952–4984, 2023

  32. [32]

    Tapilot-crossing: Benchmarking and evolving llms towards interactive data analysis agents.arXiv preprint arXiv:2403.05307, 2024

    Jinyang Li, Nan Huo, Yan Gao, Jiayi Shi, Yingxiu Zhao, Ge Qu, Yurong Wu, Chenhao Ma, Jian-Guang Lou, and Reynold Cheng. Tapilot-crossing: Benchmarking and evolving llms towards interactive data analysis agents.arXiv preprint arXiv:2403.05307, 2024

  33. [33]

    Tablebank: Table benchmark for image-based table detection and recognition

    Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: Table benchmark for image-based table detection and recognition. InProceedings of the Twelfth Language Resources and Evaluation Conference, pages 1918–1925, 2020

  34. [34]

    Auto-tables: Synthesizing multi-step transformations to relationalize tables without using examples.Proceedings of the VLDB Endowment, 16(11):3391–3403, 2023

    Peng Li, Yeye He, Cong Yan, Yue Wang, and Surajit Chaudhuri. Auto-tables: Synthesizing multi-step transformations to relationalize tables without using examples.Proceedings of the VLDB Endowment, 16(11):3391–3403, 2023. 14

  35. [35]

    Mimotable: A multi-scale spreadsheet benchmark with meta operations for table reasoning

    Zheng Li, Yang Du, Mao Zheng, and Mingyang Song. Mimotable: A multi-scale spreadsheet benchmark with meta operations for table reasoning. InProceedings of the 31st International Conference on Computational Linguistics, pages 2548–2560, 2025

  36. [36]

    Findabench: Benchmarking financial data analysis ability of large language models

    Shu Liu, Shangqing Zhao, Chenghao Jia, Xinlin Zhuang, Zhaoguang Long, Jie Zhou, Aimin Zhou, Man Lan, and Yang Chong. Findabench: Benchmarking financial data analysis ability of large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 710–725, 2025

  37. [37]

    Visfineval: A scenario-driven chinese multimodal benchmark for holistic financial understanding

    Zhaowei Liu, Xin Guo, Haotian Xia, Lingfeng Zeng, Fangqi Lou, Jinyi Niu, Mengping Li, Qi Qi, Jiahuan Li, Wei Zhang, et al. Visfineval: A scenario-driven chinese multimodal benchmark for holistic financial understanding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24099–24157, 2025

  38. [38]

    Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems, 37:94871–94908, 2024

    Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems, 37:94871–94908, 2024

  39. [39]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

  40. [40]

    Microsoft 365 copilot

    Microsoft. Microsoft 365 copilot. https://www.microsoft.com/en-us/microsoft-365 /copilot, 2024

  41. [41]

    OpenAI. Gpt-5. https://openai.com/index/introducing-gpt-5/ , 2025. Accessed: 2025-12-14

  42. [42]

    Introducing chatgpt agent

    OpenAI. Introducing chatgpt agent. https://openai.com/index/introducing-chatgpt -agent/, 2025

  43. [43]

    GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

    Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025

  44. [44]

    arXiv preprint arXiv:2508.09124, 2025

    Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, and Saravan Rajmohan. Odysseybench: Evaluating llm agents on long-horizon complex office application workflows.arXiv preprint arXiv:2508.09124, 2025

  45. [45]

    FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

    Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, et al. Finauditing: A financial taxonomy-structured multi-document benchmark for evaluating llms.arXiv preprint arXiv:2510.08886, 2025

  46. [46]

    Officebench: Benchmarking language agents across multiple applications for office automation.arXiv preprint arXiv:2407.19056, 2024

    Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, and Jingbo Shang. Officebench: Benchmarking language agents across multiple applications for office automation.arXiv preprint arXiv:2407.19056, 2024

  47. [47]

    Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

  48. [48]

    Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

    Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, et al. Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

  49. [49]

    Realhitbench: A comprehensive realistic hierarchical table benchmark for evaluating llm-based table analysis.arXiv preprint arXiv:2506.13405, 2025

    Pengzuo Wu, Yuhang Yang, Guangcheng Zhu, Chao Ye, Hong Gu, Xu Lu, Ruixuan Xiao, Bowen Bao, Yijing He, Liangyu Zha, et al. Realhitbench: A comprehensive realistic hierarchical table benchmark for evaluating llm-based table analysis.arXiv preprint arXiv:2506.13405, 2025

  50. [50]

    Grok 4.https://x.ai/news/grok-4, 2025

    xAI. Grok 4.https://x.ai/news/grok-4, 2025. 15

  51. [51]

    Vision language models for spreadsheet understanding: Challenges and opportunities

    Shiyu Xia, Junyu Xiong, Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Mengyu Zhou, Yeye He, Shi Han, and Dongmei Zhang. Vision language models for spreadsheet understanding: Challenges and opportunities. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 116–128, 2024

  52. [53]

    Finben: A holistic financial benchmark for large language models.Advances in Neural Information Processing Systems, 37:95716–95743, 2024

    Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al. Finben: A holistic financial benchmark for large language models.Advances in Neural Information Processing Systems, 37:95716–95743, 2024

  53. [54]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  54. [55]

    Xfinbench: Benchmarking llms in complex financial problem solving and reasoning

    Zhihan Zhang, Yixin Cao, and Lizi Liao. Xfinbench: Benchmarking llms in complex financial problem solving and reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8715–8758, 2025

  55. [56]

    Nl2formula: Generating spreadsheet formulas from natural language queries

    Wei Zhao, Zhitao Hou, Siyuan Wu, Yan Gao, Haoyu Dong, Yao Wan, Hongyu Zhang, Yulei Sui, and Haidong Zhang. Nl2formula: Generating spreadsheet formulas from natural language queries. InFindings of the Association for Computational Linguistics: EACL 2024, pages 2377–2388, 2024

  56. [57]

    Multimodal table understanding

    Mingyu Zheng, Xinwei Feng, Qingyi Si, Qiaoqiao She, Zheng Lin, Wenbin Jiang, and Weiping Wang. Multimodal table understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9102–9124, 2024

  57. [58]

    Sheetmind: An end-to-end llm-powered multi-agent framework for spreadsheet automation.arXiv preprint arXiv:2506.12339,

    Ruiyan Zhu, Xi Cheng, Ke Liu, Brian Zhu, Daniel Jin, Neeraj Parihar, Zhoutian Xu, and Oliver Gao. Sheetmind: An end-to-end llm-powered multi-agent framework for spreadsheet automation.arXiv preprint arXiv:2506.12339, 2025. 16 Contents 1 Introduction 2 2 FINCH: A Real-world Finance & Accounting Workflow Benchmark 3 2.1 Dataset Construction . . . . . . . . ...

  58. [59]

    You are an expert who can manipulate spreadsheets through Python code

    A role definition: “You are an expert who can manipulate spreadsheets through Python code.”

  59. [60]

    A detailed description of the compact spreadsheet encoding format with illustrative examples

  60. [61]

    The task instruction and explicit input/output file paths

  61. [62]

    Library-specific best practices (e.g.,openpyxl chart creation patterns) to mitigate common code errors

  62. [63]

    19 This structured design explicitly guides models toward generating valid, context-aligned Python code, minimizing ambiguity in task interpretation

    An explicit directive to generate Python code as the final output. 19 This structured design explicitly guides models toward generating valid, context-aligned Python code, minimizing ambiguity in task interpretation. However, for models that support reasoning traces (GPT 5.1, Gemini 3 Pro), we request explicit reasoning via the include_reasoning API param...