Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
Pith reviewed 2026-05-16 22:39 UTC · model grok-4.3
The pith
Finch introduces a benchmark of 172 real enterprise finance workflows to test AI agents on authentic spreadsheet-centric tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Finch consists of 172 composite workflows containing 384 tasks that involve 1,710 spreadsheets with 27 million cells plus PDFs and other artifacts. These were produced by an LLM-assisted mining process applied to real email threads and spreadsheet version histories, followed by over 700 hours of expert annotation to maintain fidelity to in-the-wild enterprise conditions. Human evaluations of frontier models including GPT-5.1 Pro, Claude Sonnet 4.5, and Gemini 3 Pro reveal low overall pass rates and expose specific difficulties with the interleaved and collaborative demands of the tasks.
What carries the argument
The workflow construction process that combines LLM-assisted extraction from authentic enterprise email threads and spreadsheet histories with extensive expert annotation to produce composite long-horizon tasks.
If this is right
- Frontier AI systems spend substantial time yet complete only a minority of the realistic workflows under human evaluation.
- The benchmark surfaces challenges in managing multimodal artifacts, cross-file retrieval, and long sequences of interleaved skills.
- Models must improve at preserving context and performing validation across extended enterprise processes.
- Detailed case studies identify concrete failure modes that simpler benchmarks do not reveal.
Where Pith is reading between the lines
- Similar construction methods could generate benchmarks for other professional domains such as legal or supply-chain workflows.
- Low success rates suggest that scaling current architectures may be insufficient without new mechanisms for sustained planning and collaboration.
- Enterprises could adopt Finch-style tests to assess readiness before deploying agents in live finance operations.
- The annotation effort provides a reusable template for creating high-fidelity evaluation data in data-intensive fields.
Load-bearing premise
The 172 workflows mined from Enron and other 2000-2025 financial institution data, after LLM-assisted extraction and expert annotation, faithfully represent the distribution and difficulty of typical enterprise finance and accounting work.
What would settle it
Extracting an independent collection of workflows from a fresh set of financial institution archives and re-running the same AI agents to determine whether the observed pass rates and failure patterns remain consistent.
Figures
read the original abstract
We introduce FinWorkBench (a.k.a. Finch) for evaluating AI agents on real-world, enterprise-grade finance and accounting workflows that interleave data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces from Enron (15,000 files and 500,000 emails) and other financial institutions, covering the period 2000--2025 and preserving the in-the-wild messiness of multimodal artifacts such as tables and charts across diverse domains including budgeting, trading, asset management, and operational management. We propose a workflow construction process that combines LLM-assisted mining of workflows from authentic enterprise environments with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and spreadsheet version histories, and (2) meticulous annotation requiring over 700 hours of expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems, including GPT-5.1, Claude Sonnet 4.5, Claude Opus 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max. Under human evaluation, GPT-5.1 Pro spends an average of 16.8 minutes per workflow yet passes only 38.4% of workflows. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FinWorkBench (Finch), a benchmark of 172 composite workflows (384 tasks) mined from Enron (15k files, 500k emails) and other 2000-2025 institutional data. Workflows are built via LLM-assisted extraction plus >700 hours of expert annotation, spanning 1,710 spreadsheets (27M cells) plus PDFs and artifacts; they interleave data entry, cross-file retrieval, calculation, modeling, visualization, and reporting. Human evaluations show frontier models (GPT-5.1, Claude Sonnet/Opus 4.5, Gemini 3 Pro, etc.) achieve low pass rates, e.g., GPT-5.1 at 38.4 % despite averaging 16.8 min per workflow.
Significance. If the workflows are shown to be representative of typical enterprise distributions, Finch would provide a useful, high-fidelity testbed that exposes gaps in current agents on long-horizon, multimodal, knowledge-intensive finance tasks that existing synthetic benchmarks do not capture.
major comments (2)
- [§3 (Workflow Construction)] §3 (Workflow Construction): No inter-annotator agreement statistics, sampling frame, or external validation against industry task distributions are reported for the LLM-assisted extraction and 700-hour expert annotation process. This directly weakens the central claim that the 172 workflows faithfully represent the difficulty and mix of real enterprise finance/accounting work, especially given documented irregularities in the Enron corpus.
- [§5 (Experiments)] §5 (Experiments): The reported pass rates (e.g., GPT-5.1 at 38.4 %) are presented only in aggregate; no per-component breakdown (calculation vs. retrieval vs. validation) or comparison to simpler baselines is given, limiting the ability to diagnose load-bearing failure modes.
minor comments (2)
- [Abstract] Abstract and §1: The model name 'GPT-5.1' should be explicitly defined or footnoted in the main text to avoid ambiguity about whether it refers to a released or hypothetical system.
- [Related Work] Related Work: A short comparison table or paragraph contrasting Finch with prior spreadsheet benchmarks (formula prediction, table QA) would clarify the added value of the interleaved, multi-artifact workflow setting.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on workflow construction and experimental reporting. We address each point below and will revise the manuscript to provide greater transparency on the annotation process and more granular experimental breakdowns.
read point-by-point responses
-
Referee: [§3 (Workflow Construction)] §3 (Workflow Construction): No inter-annotator agreement statistics, sampling frame, or external validation against industry task distributions are reported for the LLM-assisted extraction and 700-hour expert annotation process. This directly weakens the central claim that the 172 workflows faithfully represent the difficulty and mix of real enterprise finance/accounting work, especially given documented irregularities in the Enron corpus.
Authors: We agree that additional details on the construction process would strengthen the representativeness claim. The workflows were derived via LLM-assisted extraction from the Enron corpus and other 2000-2025 institutional sources, followed by sequential expert verification exceeding 700 hours. We will revise §3 to include a full description of the sampling frame, the annotation protocol with quality-control steps, and explicit discussion of Enron irregularities along with our mitigation via expert curation. Formal inter-annotator agreement statistics are not available because the process used sequential rather than parallel independent annotations; we will note this limitation. External validation against comprehensive industry distributions is not feasible with available data, but we will add a discussion of domain coverage across budgeting, trading, asset management, and operational management to support the claim of enterprise fidelity. revision: partial
-
Referee: [§5 (Experiments)] §5 (Experiments): The reported pass rates (e.g., GPT-5.1 at 38.4 %) are presented only in aggregate; no per-component breakdown (calculation vs. retrieval vs. validation) or comparison to simpler baselines is given, limiting the ability to diagnose load-bearing failure modes.
Authors: We agree that aggregate results limit diagnostic value. In the revision we will add a per-component breakdown of success rates across the 384 tasks, separating performance on data entry, cross-file retrieval, calculation/modeling, validation, visualization, and reporting. We will also include comparisons against simpler baselines such as single-task LLMs and basic retrieval agents to isolate the impact of long-horizon composition. These additions, including updated tables, will be incorporated into the revised §5. revision: yes
- Formal inter-annotator agreement statistics and external validation against industry task distributions cannot be provided because they were not collected during the sequential annotation process and no such reference datasets were available.
Circularity Check
No circularity: empirical benchmark construction from external sources
full rationale
The paper introduces FinWorkBench by mining 172 workflows from external Enron (15k files) and other 2000-2025 institutional data using LLM-assisted extraction followed by 700+ hours of expert annotation. No equations, parameter fitting, predictions, or derivations are present; the central claim is the benchmark's creation and evaluation on frontier models. All load-bearing steps rely on external data and human annotation rather than self-definition or self-citation chains, making the work self-contained as an empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real-world enterprise data from Enron and similar institutions, after expert annotation, accurately reflects typical finance and accounting workflows.
Forward citations
Cited by 2 Pith papers
-
WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
WorkstreamBench evaluates LLM agents on end-to-end financial spreadsheet creation and finds that even top models like Claude fall short of professional standards, with performance dropping sharply on complex tasks.
-
Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows
EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-speciali...
Reference graph
Works this paper leans on
-
[1]
Shortcut ai for spreadsheets.https://www.tryshortcut.ai/, 2024
Shortcut AI. Shortcut ai for spreadsheets.https://www.tryshortcut.ai/, 2024
work page 2024
-
[2]
Claude for excel.https://claude.com/claude-for-excel, 2025
Anthropic. Claude for excel.https://claude.com/claude-for-excel, 2025
work page 2025
-
[3]
Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/claude-s onnet-4-5, 2025. Accessed: 2025-12-14
work page 2025
-
[4]
World Bank, Washington, DC, 2024
World Bank.International Debt Report 2024. World Bank, Washington, DC, 2024. World Bank’s annual publication on external debt statistics
work page 2024
-
[5]
Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance agent benchmark: Benchmarking llms on real-world financial research tasks.arXiv preprint arXiv:2508.00828, 2025
-
[6]
Sibei Chen, Yeye He, Weiwei Cui, Ju Fan, Song Ge, Haidong Zhang, Dongmei Zhang, and Surajit Chaudhuri. Auto-formula: Recommend formulas in spreadsheets using contrastive learning for table representations.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024
work page 2024
-
[7]
Yibin Chen, Yifu Yuan, Zeyu Zhang, Yan Zheng, Jinyi Liu, Fei Ni, Jianye Hao, Hangyu Mao, and Fuzheng Zhang. Sheetagent: towards a generalist agent for spreadsheet reasoning and manipulation via large language models. InProceedings of the ACM on Web Conference 2025, pages 158–177, 2025
work page 2025
-
[8]
Finqa: A dataset of numerical reasoning over financial data
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, 2021
work page 2021
-
[9]
Fortap: Using formulas for numerical-reasoning-aware table pretraining
Zhoujun Cheng, Haoyu Dong, Ran Jia, Pengfei Wu, Shi Han, Fan Cheng, and Dongmei Zhang. Fortap: Using formulas for numerical-reasoning-aware table pretraining. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1150–1166, 2022
work page 2022
-
[10]
Finagentbench: A benchmark dataset for agentic retrieval in financial question answering
Chanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira, Chaewoon Kim, Minjae Kim, Juneha Hwang, Jaeseon Ha, Hojun Choi, Suyeol Yun, Yongjin Kim, et al. Finagentbench: A benchmark dataset for agentic retrieval in financial question answering. InProceedings of the 6th ACM International Conference on AI in Finance, pages 632–637, 2025
work page 2025
-
[11]
Fiscal reference tables, november 2025
Department of Finance Canada. Fiscal reference tables, november 2025. Technical report, Government of Canada, Ottawa, Canada, 2025. Provides annual data on the financial position of the federal, provincial-territorial and local governments
work page 2025
-
[12]
Haoyu Dong, Yue Hu, and Yanan Cao. Reasoning and retrieval for complex semi-structured tables via reinforced relational data transformation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1382– 1391, 2025
work page 2025
-
[13]
Tablesense: Spreadsheet table detection with convolutional neural networks
Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, and Dongmei Zhang. Tablesense: Spreadsheet table detection with convolutional neural networks. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 69–76, 2019
work page 2019
-
[14]
Ttc-quali: A text-table-chart dataset for multimodal quantity alignment
Haoyu Dong, Haochen Wang, Anda Zhou, and Yue Hu. Ttc-quali: A text-table-chart dataset for multimodal quantity alignment. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 181–189, 2024
work page 2024
-
[15]
Neural formatting for spreadsheet tables
Haoyu Dong, Jinyu Wang, Zhouyu Fu, Shi Han, and Dongmei Zhang. Neural formatting for spreadsheet tables. InProceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 305–314, 2020
work page 2020
-
[16]
MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, and Guolin Ke. Machinelearn- inglm: Scaling many-shot in-context learning via continued pretraining.arXiv preprint arXiv:2509.06806, 2025. 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Spreadsheetllm: Encodi ng spreadsheets for large language models
Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Junyu Xiong, Shiyu Xia, Mengyu Zhou, Yun Lin, José Cambronero, Yeye He, Shi Han, et al. Spreadsheetllm: encoding spreadsheets for large language models.arXiv preprint arXiv:2407.09025, 2024
-
[18]
Dabstep: Data agent benchmark for multi-step reasoning.arXiv preprint arXiv:2506.23719, 2025
Alex Egg, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, and Thomas Wolf. Dabstep: Data agent benchmark for multi-step reasoning.arXiv preprint arXiv:2506.23719, 2025
-
[19]
EnronData.org. Edo enron email pst dataset. https://enrondata.readthedocs.io/e n/latest/data/edo-enron-email-pst-dataset/ . Creative Commons Attribution 3.0 United States License. To provide attribution, please cite to “EnronData.org.”
-
[20]
Marc Fisher and Gregg Rothermel. The euses spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. InProceedings of the first workshop on End-user software engineering, pages 1–5, 2005
work page 2005
-
[21]
Gemini for google workspace.https://workspace.google.com/solutions/ai/, 2024
Google. Gemini for google workspace.https://workspace.google.com/solutions/ai/, 2024
work page 2024
- [22]
-
[23]
Accessed: 2025-12-14
work page 2025
-
[24]
Hermes: Interactive spreadsheet formula prediction via hierarchical formulet expansion
Wanrong He, Haoyu Dong, Yihuai Gao, Zhichao Fan, Xingzhuo Guo, Zhitao Hou, Xiao Lv, Ran Jia, Shi Han, and Dongmei Zhang. Hermes: Interactive spreadsheet formula prediction via hierarchical formulet expansion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8356–8372, 2023
work page 2023
-
[25]
Public expenditure statistical analyses 2023
HM Treasury. Public expenditure statistical analyses 2023. Technical report, HM Treasury, London, United Kingdom, 2023. UK public expenditure statistical release (PESA)
work page 2023
-
[26]
Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, et al. Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning.arXiv preprint arXiv:2509.13160, 2025
-
[27]
Amila Indika and Igor Molybog. Sodbench: A large language model approach to documenting spreadsheet operations.arXiv preprint arXiv:2510.19864, 2025
-
[28]
FinanceBench: A New Benchmark for Financial Question Answering
Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
The enron corpus: A new dataset for email classification research
Bryan Klimt and Yiming Yang. The enron corpus: A new dataset for email classification research. InEuropean conference on machine learning, pages 217–226. Springer, 2004
work page 2004
-
[30]
Investorbench: A benchmark for financial decision-making tasks with llm-based agent
Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, Kp Subbalakshmi, Jimin Huang, et al. Investorbench: A benchmark for financial decision-making tasks with llm-based agent. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2...
work page 2025
-
[31]
Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and Zhao-Xiang Zhang. Sheetcopilot: Bringing software productivity to the next level through large language models.Advances in Neural Information Processing Systems, 36:4952–4984, 2023
work page 2023
-
[32]
Jinyang Li, Nan Huo, Yan Gao, Jiayi Shi, Yingxiu Zhao, Ge Qu, Yurong Wu, Chenhao Ma, Jian-Guang Lou, and Reynold Cheng. Tapilot-crossing: Benchmarking and evolving llms towards interactive data analysis agents.arXiv preprint arXiv:2403.05307, 2024
-
[33]
Tablebank: Table benchmark for image-based table detection and recognition
Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: Table benchmark for image-based table detection and recognition. InProceedings of the Twelfth Language Resources and Evaluation Conference, pages 1918–1925, 2020
work page 1918
-
[34]
Peng Li, Yeye He, Cong Yan, Yue Wang, and Surajit Chaudhuri. Auto-tables: Synthesizing multi-step transformations to relationalize tables without using examples.Proceedings of the VLDB Endowment, 16(11):3391–3403, 2023. 14
work page 2023
-
[35]
Mimotable: A multi-scale spreadsheet benchmark with meta operations for table reasoning
Zheng Li, Yang Du, Mao Zheng, and Mingyang Song. Mimotable: A multi-scale spreadsheet benchmark with meta operations for table reasoning. InProceedings of the 31st International Conference on Computational Linguistics, pages 2548–2560, 2025
work page 2025
-
[36]
Findabench: Benchmarking financial data analysis ability of large language models
Shu Liu, Shangqing Zhao, Chenghao Jia, Xinlin Zhuang, Zhaoguang Long, Jie Zhou, Aimin Zhou, Man Lan, and Yang Chong. Findabench: Benchmarking financial data analysis ability of large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 710–725, 2025
work page 2025
-
[37]
Visfineval: A scenario-driven chinese multimodal benchmark for holistic financial understanding
Zhaowei Liu, Xin Guo, Haotian Xia, Lingfeng Zeng, Fangqi Lou, Jinyi Niu, Mengping Li, Qi Qi, Jiahuan Li, Wei Zhang, et al. Visfineval: A scenario-driven chinese multimodal benchmark for holistic financial understanding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24099–24157, 2025
work page 2025
-
[38]
Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems, 37:94871–94908, 2024
work page 2024
-
[39]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021
work page 2021
-
[40]
Microsoft. Microsoft 365 copilot. https://www.microsoft.com/en-us/microsoft-365 /copilot, 2024
work page 2024
-
[41]
OpenAI. Gpt-5. https://openai.com/index/introducing-gpt-5/ , 2025. Accessed: 2025-12-14
work page 2025
-
[42]
OpenAI. Introducing chatgpt agent. https://openai.com/index/introducing-chatgpt -agent/, 2025
work page 2025
-
[43]
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
arXiv preprint arXiv:2508.09124, 2025
Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, and Saravan Rajmohan. Odysseybench: Evaluating llm agents on long-horizon complex office application workflows.arXiv preprint arXiv:2508.09124, 2025
-
[45]
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, et al. Finauditing: A financial taxonomy-structured multi-document benchmark for evaluating llms.arXiv preprint arXiv:2510.08886, 2025
work page internal anchor Pith review arXiv 2025
-
[46]
Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, and Jingbo Shang. Officebench: Benchmarking language agents across multiple applications for office automation.arXiv preprint arXiv:2407.19056, 2024
-
[47]
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024
work page 2024
-
[48]
Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025
Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, et al. Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025
-
[49]
Pengzuo Wu, Yuhang Yang, Guangcheng Zhu, Chao Ye, Hong Gu, Xu Lu, Ruixuan Xiao, Bowen Bao, Yijing He, Liangyu Zha, et al. Realhitbench: A comprehensive realistic hierarchical table benchmark for evaluating llm-based table analysis.arXiv preprint arXiv:2506.13405, 2025
- [50]
-
[51]
Vision language models for spreadsheet understanding: Challenges and opportunities
Shiyu Xia, Junyu Xiong, Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Mengyu Zhou, Yeye He, Shi Han, and Dongmei Zhang. Vision language models for spreadsheet understanding: Challenges and opportunities. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 116–128, 2024
work page 2024
-
[53]
Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al. Finben: A holistic financial benchmark for large language models.Advances in Neural Information Processing Systems, 37:95716–95743, 2024
work page 2024
-
[54]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page 2025
-
[55]
Xfinbench: Benchmarking llms in complex financial problem solving and reasoning
Zhihan Zhang, Yixin Cao, and Lizi Liao. Xfinbench: Benchmarking llms in complex financial problem solving and reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8715–8758, 2025
work page 2025
-
[56]
Nl2formula: Generating spreadsheet formulas from natural language queries
Wei Zhao, Zhitao Hou, Siyuan Wu, Yan Gao, Haoyu Dong, Yao Wan, Hongyu Zhang, Yulei Sui, and Haidong Zhang. Nl2formula: Generating spreadsheet formulas from natural language queries. InFindings of the Association for Computational Linguistics: EACL 2024, pages 2377–2388, 2024
work page 2024
-
[57]
Multimodal table understanding
Mingyu Zheng, Xinwei Feng, Qingyi Si, Qiaoqiao She, Zheng Lin, Wenbin Jiang, and Weiping Wang. Multimodal table understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9102–9124, 2024
work page 2024
-
[58]
Ruiyan Zhu, Xi Cheng, Ke Liu, Brian Zhu, Daniel Jin, Neeraj Parihar, Zhoutian Xu, and Oliver Gao. Sheetmind: An end-to-end llm-powered multi-agent framework for spreadsheet automation.arXiv preprint arXiv:2506.12339, 2025. 16 Contents 1 Introduction 2 2 FINCH: A Real-world Finance & Accounting Workflow Benchmark 3 2.1 Dataset Construction . . . . . . . . ...
-
[59]
You are an expert who can manipulate spreadsheets through Python code
A role definition: “You are an expert who can manipulate spreadsheets through Python code.”
-
[60]
A detailed description of the compact spreadsheet encoding format with illustrative examples
-
[61]
The task instruction and explicit input/output file paths
-
[62]
Library-specific best practices (e.g.,openpyxl chart creation patterns) to mitigate common code errors
-
[63]
An explicit directive to generate Python code as the final output. 19 This structured design explicitly guides models toward generating valid, context-aligned Python code, minimizing ambiguity in task interpretation. However, for models that support reasoning traces (GPT 5.1, Gemini 3 Pro), we request explicit reasoning via the include_reasoning API param...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.