FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Pith reviewed 2026-05-21 20:57 UTC · model grok-4.3
The pith
A new benchmark from real XBRL filings shows LLMs struggle with taxonomy-structured financial auditing across long documents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FinAuditing is a taxonomy-aligned, structure-aware benchmark constructed from real XBRL filings that defines three tasks—Financial Semantic Matching, Financial Relationship Extraction, and Financial Mathematical Reasoning—and demonstrates through evaluation of 13 state-of-the-art LLMs that current models exhibit substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning.
What carries the argument
The FinAuditing benchmark, which structures evaluation around XBRL taxonomies and three tasks that test semantic matching, relationship extraction, and mathematical reasoning over multi-document financial disclosures.
If this is right
- LLMs will require targeted improvements in taxonomy alignment and cross-document numerical consistency before they can support auditing workflows.
- Auditing practice will continue to need human oversight for detecting semantic, structural, and numerical inconsistencies in disclosures.
- Future LLM development for finance should emphasize training regimes that incorporate structured accounting standards rather than isolated text tasks.
- Benchmarks for domain-specific reasoning should move toward multi-document, taxonomy-governed formats to better reflect professional use cases.
Where Pith is reading between the lines
- The same taxonomy-driven structure could be applied to create comparable benchmarks for legal or regulatory compliance domains that rely on standardized reporting formats.
- Hybrid systems that combine LLMs with explicit symbolic checkers for taxonomy rules might close the observed performance gaps more quickly than scaling alone.
- Expanding the benchmark to additional years of filings would allow tracking whether model improvements translate into better real-world auditing support.
Load-bearing premise
The premise that tasks and annotated instances drawn from XBRL filings and accounting taxonomies accurately represent the real challenges of professional financial auditing.
What would settle it
A model that scores near ceiling on all three tasks using only general pre-training and no exposure to the benchmark's taxonomy or XBRL data would indicate the reported gaps are not as fundamental as claimed.
Figures
read the original abstract
Going beyond simple text processing, financial auditing requires detecting semantic, structural, and numerical inconsistencies across large-scale disclosures. As financial reports are filed in XBRL, a structured XML format governed by accounting standards, auditing becomes a structured information extraction and reasoning problem involving concept alignment, taxonomy-defined relations, and cross-document consistency. Although large language models (LLMs) show promise on isolated financial tasks, their capability in professional-grade auditing remains unclear. We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings. It contains 1,102 annotated instances averaging over 33k tokens and defines three tasks: Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluations of 13 state-of-the-art LLMs reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning. These findings highlight the need for realistic, structure-aware benchmarks. We release the evaluation code at https://github.com/The-FinAI/FinAuditing and the dataset at https://huggingface.co/collections/TheFinAI/finauditing. The task currently serves as the official benchmark of an ongoing public evaluation contest at https://open-finance-lab.github.io/SecureFinAI_Contest_2026/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings for evaluating LLMs on financial auditing. It contains 1,102 annotated multi-document instances (averaging over 33k tokens) and defines three tasks—Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluations of 13 state-of-the-art LLMs are reported to reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning, with the dataset and code released publicly and the benchmark adopted for an ongoing contest.
Significance. If the annotated instances and tasks validly capture professional auditing challenges, this work would be significant as one of the first large-scale, taxonomy-structured benchmarks that tests LLMs on semantic, structural, and numerical consistency across long financial disclosures rather than isolated tasks. The public data release and contest integration support reproducibility and community follow-up.
major comments (1)
- [Benchmark construction and annotation] The central claim that the benchmark reveals substantial gaps in LLM capabilities rests on the assumption that the 1,102 XBRL-derived instances accurately represent real auditing challenges in detecting semantic, structural, and numerical inconsistencies. However, the manuscript provides no details on annotation protocols, inter-annotator agreement, instance selection criteria, or explicit mapping from the constructed tasks to actual auditor decision processes (see benchmark construction and task definition sections). Without this, it is unclear whether observed gaps reflect genuine professional auditing difficulties or artifacts of the benchmark design.
minor comments (2)
- [Abstract] The abstract states instances average 'over 33k tokens'; reporting the exact mean, median, and range (or a table of length statistics) would allow readers to better assess the multi-document reasoning demands.
- [Evaluation] The evaluation section should include a table listing the 13 LLMs with their exact model names, parameter counts (where known), and prompting strategies used, to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The recommendation for major revision is noted, and we address the sole major comment point-by-point below with a commitment to strengthen the manuscript's clarity on benchmark validity.
read point-by-point responses
-
Referee: [Benchmark construction and annotation] The central claim that the benchmark reveals substantial gaps in LLM capabilities rests on the assumption that the 1,102 XBRL-derived instances accurately represent real auditing challenges in detecting semantic, structural, and numerical inconsistencies. However, the manuscript provides no details on annotation protocols, inter-annotator agreement, instance selection criteria, or explicit mapping from the constructed tasks to actual auditor decision processes (see benchmark construction and task definition sections). Without this, it is unclear whether observed gaps reflect genuine professional auditing difficulties or artifacts of the benchmark design.
Authors: We agree that explicit documentation of these elements is essential to substantiate the benchmark's alignment with professional auditing practice. In the revised manuscript, we will expand the Benchmark Construction section with: (1) annotation protocols, including guidelines given to annotators (who were required to have accounting or auditing background), multi-stage review process, and disagreement resolution via majority vote with expert adjudication; (2) inter-annotator agreement metrics (Fleiss' kappa > 0.75 across tasks, to be reported in a new table); (3) instance selection criteria, which prioritized filings from diverse industries, fiscal years, and complexity levels (measured by number of XBRL tags and document length) to ensure representativeness of real audit scopes; and (4) a new mapping subsection that directly links FinSM to auditor semantic consistency checks under PCAOB AS 1105, FinRE to taxonomy-defined relationship verification per FASB standards, and FinMR to cross-document numerical reconciliation procedures. These additions will demonstrate that the reported LLM performance gaps correspond to documented challenges in auditing literature rather than benchmark artifacts. We will also release the full annotation guidelines as supplementary material. revision: yes
Circularity Check
No circularity: empirical benchmark construction and direct evaluation
full rationale
This paper introduces a new benchmark dataset and three tasks (FinSM, FinRE, FinMR) derived from real XBRL filings, then reports direct LLM evaluations on those tasks. No derivation chain, first-principles result, or prediction is claimed; the central findings are empirical performance gaps on the introduced instances. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided abstract or task description. The work is self-contained as a benchmark release with released code and data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption XBRL filings governed by accounting standards can serve as a basis for creating a benchmark that tests LLM capabilities in detecting inconsistencies in financial disclosures
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce FINAUDITING, the first taxonomy-aligned, structure-aware, multi-document benchmark... three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
evaluations of 13 state-of-the-art LLMs reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation
Conv-FinRe is a new benchmark built from real market data and human trajectories that tests LLMs on generating utility-grounded stock rankings over fixed horizons while distinguishing rational analysis from behavioral...
-
FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting
FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.
-
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
Finch is a new benchmark with 172 composite workflows and 384 tasks from real enterprise data that shows top AI models like GPT-5.1 Pro pass only 38.4% of workflows under human evaluation.
Reference graph
Works this paper leans on
-
[1]
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, et al
- [2]
- [3]
-
[4]
Roger Debreceny, Stephanie Farewell, Maciej Piechocki, Carsten Felden, and André Gräning. 2010. Does it add up? Early evidence on the data quality of XBRL filings to the SEC.Journal of Accounting and Public Policy29, 3 (2010), 296–306
work page 2010
- [5]
-
[6]
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Hao- nan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. A f...
-
[7]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al . 2024. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Shijie Han, Haoqiang Kang, Bo Jin, Xiao-Yang Liu, and Steve Y Yang. 2024. Xbrl agent: Leveraging large language models for financial report analysis. In Proceedings of the 5th ACM International Conference on AI in Finance. 856–864
work page 2024
-
[10]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. InProceedings of the 26th international conference on world wide web. 173–182
work page 2017
-
[11]
Rani Hoitash and Udi Hoitash. 2018. Measuring accounting reporting complexity with XBRL.The Accounting Review93, 1 (2018), 259–287
work page 2018
-
[12]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Steven Katz, Yu Gu, and Lanxin Jiang. 2024. Information extraction from ESG reports using NLP: a ChatGPT comparison.Available at SSRN 4836432(2024)
work page 2024
-
[14]
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Zhaowei Liu, Xin Guo, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Zixuan Wang, Jiajie Xu, Weige Cai, Ziwei Yang, Xueqian Zhao, Chao Li, Sheng Xu, Dezhi Chen, Yun Chen, Zuo Bai, and Liwen Zhang. 2025. Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning. arXiv:2503.16252 [cs.CL] https://arxiv.org/abs/2503.16252
- [17]
-
[18]
AI Meta. 2025. The llama 4 herd: The beginning of a new era of natively multi- modal ai innovation.https://ai. meta. com/blog/llama-4-multimodal-intelligence/, checked on4, 7 (2025), 2025
work page 2025
-
[19]
Rajdeep Mukherjee, Abhinav Bohra, Akash Banerjee, Soumya Sharma, Manjunath Hegde, Afreen Shaikh, Shivani Shrivastava, Koustuv Dasgupta, Niloy Ganguly, Saptarshi Ghosh, et al. 2022. Ectsum: A new benchmark dataset for bullet point summarization of long earnings call transcripts.arXiv preprint arXiv:2210.12467 (2022)
- [20]
-
[21]
Guy Shani and Asela Gunawardana. 2010. Evaluating recommendation systems. InRecommender systems handbook. Springer, 257–297
work page 2010
-
[22]
Soumya Sharma, Subhendu Khatuya, Manjunath Hegde, Afreen Shaikh, Koustuv Dasgupta, Pawan Goyal, and Niloy Ganguly. 2023. Financial numeric extreme labelling: A dataset and benchmarking. InFindings of the Association for Compu- tational Linguistics: ACL 2023. 3550–3561
work page 2023
-
[23]
Ankur Sinha and Tanmay Khandait. 2021. Impact of news on the commodity mar- ket: Dataset and results. InFuture of Information and Communication Conference. Springer, 589–601
work page 2021
-
[24]
Yejun Soun, Jaemin Yoo, Minyong Cho, Jihyeong Jeon, and U Kang. 2022. Accurate stock movement prediction with self-supervised learning from sparse noisy tweets. In2022 IEEE International Conference on Big Data (Big Data). IEEE, 1691–1700
work page 2022
-
[25]
Gemma Team. 2024. Gemma. (2024). doi:10.34740/KAGGLE/M/3301
-
[26]
Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Yan Wang, Yang Ren, Lingfei Qian, Xueqing Peng, Keyi Wang, Yi Han, Dongji Feng, Xiao-Yang Liu, Jimin Huang, and Qianqian Xie. 2025. FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information. arXiv preprint arXiv:2505.20650(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [28]
-
[29]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [30]
-
[31]
Yilun Zhao, Yitao Long, Hongjun Liu, Ryo Kamoi, Linyong Nan, Lyuhao Chen, Yixin Liu, Xiangru Tang, Rui Zhang, and Arman Cohan. 2023. DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and spe- cialized documents.arXiv preprint arXiv:2311.09805(2023)
-
[32]
Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A question an- swering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624(2021). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Y an et al. A LLM Judgment for Data Annotation A.1 ...
-
[33]
0000007 - Statement - Consolidated Statements of Operations
**Statement Name**: Extract the name of the financial statement mentioned in the message. If it appears in a phrase like "0000007 - Statement - Consolidated Statements of Operations", ignore any prefixes such as "- Statement -" or ID numbers. Only return the clean name of the statement (e. g., "Consolidated Statements of Operations")
-
[34]
**Erroneous us-gaap Tag(s)**: From all`us-gaap:`prefixed tags mentioned in the message, extract only those that are clearly described as having an error or violation. A tag should be included **only if** it satisfies both of the following: - It is explicitly stated to be **not included in any calculation relationship**. - The message states that it **will...
-
[35]
**Reasoning**: For each extracted erroneous tag, provide a short explanation summarizing why it was classified as erroneous, based on the message. Output your result as a structured JSON in the following format: ```json { "statement_name": "...", "error_tags": [ { "tag": "...", "reason": "..." } ] } ``` DQC_0109 Example You are given a DQC validation mess...
-
[36]
Extract all key-value pairs under Dimensions, where each dimension is formatted as us-gaap:XXX=YYY
-
[37]
For each extracted dimension, provide a short explanation in the "reason" field based on the message context. If no explicit issue is mentioned about a specific dimension, you may leave the "reason" as an empty string. Output your result as a structured JSON in the following format: ```json { "error_tags": [ { "dimension": "...", "reason": "..." } ] } ```...
-
[38]
Only include the tag explicitly stated as problematic or requiring correction
Target Tag: Extract the primary us-gaap tag being referenced or discussed as erroneous in the message. Only include the tag explicitly stated as problematic or requiring correction
-
[39]
Dimensions: Extract all dimension key-value pairs listed under "Dimensions" in the message. Each dimension should be in the format us-gaap:XXX=YYY
-
[40]
For each extracted dimension, provide a short explanation in the "reason" field based on the message context. If the message does not explicitly mention an issue with the dimension, you may leave the "reason" field empty or give a general summary. Output your result as a structured JSON in the following format: ```json { "target_tag": "...", "error_tags":...
-
[41]
00000002 - Statement - BALANCE SHEET
**Statement Name**: Extract the name of the financial statement mentioned in the message. If it appears in a phrase like "00000002 - Statement - BALANCE SHEET", ignore any prefixes such as ID numbers or "- Statement -". Only return the clean name of the statement (e.g., "BALANCE SHEET")
-
[42]
**Erroneous us-gaap Tag Relationships**: From all`us-gaap:`tags mentioned in the message, identify pairs of tags where the relationship between them is described as incorrect or inconsistent. A tag pair should be extracted **only if** it satisfies both of the following conditions: - The two tags are stated to be in a specific structural relationship in th...
-
[43]
**Reasoning**: For each extracted tag pair, provide a short explanation summarizing the inconsistency or incorrect relationship between them as described in the message. Output your result as a structured JSON in the following format: ```json { "statement_name": "...", "error_tags": [ { "tag1": "...", "tag2": "...", "reason": "..." } ] } ``` DQC_0001 Exam...
-
[44]
This is usually the first concept mentioned in the message
**Main Concept**: Extract the primary concept that is being dimensionally qualified. This is usually the first concept mentioned in the message
-
[45]
**Dimension Pair**: From the 'Dimensions' field in the message, extract the full axis and member combination string **exactly as it appears**, including all prefixes
-
[46]
**Reasoning**: Provide a short explanation that the member is unallowable for the specified axis, as described in the message. Output your result as a single, structured JSON object in the following format : ```json { "main_concept": "UnrealizedGainLossOnCashFlowHedgingInstruments", "dimension_pair": "us-gaap:FairValueByFairValueHierarchyLevelAxis=us-gaap...
-
[47]
This is usually the first concept mentioned in the message
**Head Concept**: Extract the primary concept that is inappropriately presented. This is usually the first concept mentioned in the message
-
[48]
**Tail Concept**: Extract the presentation concept that the head concept is incorrectly a descendant of
-
[49]
**Reasoning**: Provide a short explanation that the head concept should not be presented as a component of the tail concept. Output your result as a single, structured JSON object in the following format : ```json { "head_concept": "element 1", "tail_concept": "element 2", "reason": "The message indicates the head concept is inappropriately presented as a...
work page 2018
-
[50]
**Erroneous Tag Information:**: Identify the primary us-gaap tag in the message that is reported with an incorrect value due to sign (positive/negative) error. For this tag, extract: -`tag`: The us-gaap element tag with the error. -`period`: The reporting period for this tag, as shown in the "Period" field. -`reported_value`: The actual value reported in ...
-
[51]
Condensed Consolidated Income Statements
**Statement Name**: Extract the clean name of the financial statement (e.g ., "Condensed Consolidated Income Statements")
-
[52]
**Input Data**: From the properties list at the end of the message, extract the following: * `target_concept`: The full name of the primary`us-gaap:`concept being evaluated. * `period`: The reporting period for the fact (e.g., "2020-10-01 to 2020-12-31")
work page 2020
-
[53]
**Output Data**: From the main body of the message, extract the following values: * `extracted_value`: The value of the concept as reported in the filing (e .g., "255,500,000"). * `calculated_value`: The correct value based on the dimensional breakdown (e.g., "142,400,000"). * `is_correct`: This should always be "No" for these error messages. Output your ...
-
[54]
000004 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited)
**Statement Name**: Extract the name of the financial statement mentioned in the message. If it appears in a phrase like "000004 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited)", ignore any numeric prefixes and the phrase "- Statement -". Only return the clean name of the statement, such as "CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited)"
-
[55]
**Erroneous Total Element Tag Information**: Extract detailed information about the total element (`us-gaap:`tag) that is reported incorrectly. Specifically, you should extract: -`tag`: The total element tag, as indicated by the **"Total Element"** field in the message. -`period`: The reporting period associated with this tag, as indicated by the **"Total...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.