pith. sign in

arxiv: 2510.08886 · v3 · pith:UKJQJKAJnew · submitted 2025-10-10 · 💻 cs.CL · cs.CE· cs.IR

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Pith reviewed 2026-05-21 20:57 UTC · model grok-4.3

classification 💻 cs.CL cs.CEcs.IR
keywords financial auditingLLM evaluationXBRLtaxonomymulti-document reasoningsemantic matchingrelationship extractionmathematical reasoning
0
0 comments X

The pith

A new benchmark from real XBRL filings shows LLMs struggle with taxonomy-structured financial auditing across long documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FinAuditing, a benchmark built directly from XBRL financial reports to test LLMs on the actual work of professional auditing. It turns auditing into three concrete tasks that require aligning concepts to accounting taxonomies, extracting relations defined by those taxonomies, and performing numerical checks across multiple lengthy filings. The benchmark supplies 1,102 annotated examples, each averaging more than 33,000 tokens, drawn from real disclosures. When 13 current leading models are run on these tasks, they display clear shortfalls in retrieving the right concepts, respecting taxonomy-defined relations, and maintaining consistency from one document to the next. The work therefore argues that simpler text-only tests do not capture the structured demands of financial auditing and that more realistic benchmarks are required.

Core claim

FinAuditing is a taxonomy-aligned, structure-aware benchmark constructed from real XBRL filings that defines three tasks—Financial Semantic Matching, Financial Relationship Extraction, and Financial Mathematical Reasoning—and demonstrates through evaluation of 13 state-of-the-art LLMs that current models exhibit substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning.

What carries the argument

The FinAuditing benchmark, which structures evaluation around XBRL taxonomies and three tasks that test semantic matching, relationship extraction, and mathematical reasoning over multi-document financial disclosures.

If this is right

  • LLMs will require targeted improvements in taxonomy alignment and cross-document numerical consistency before they can support auditing workflows.
  • Auditing practice will continue to need human oversight for detecting semantic, structural, and numerical inconsistencies in disclosures.
  • Future LLM development for finance should emphasize training regimes that incorporate structured accounting standards rather than isolated text tasks.
  • Benchmarks for domain-specific reasoning should move toward multi-document, taxonomy-governed formats to better reflect professional use cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same taxonomy-driven structure could be applied to create comparable benchmarks for legal or regulatory compliance domains that rely on standardized reporting formats.
  • Hybrid systems that combine LLMs with explicit symbolic checkers for taxonomy rules might close the observed performance gaps more quickly than scaling alone.
  • Expanding the benchmark to additional years of filings would allow tracking whether model improvements translate into better real-world auditing support.

Load-bearing premise

The premise that tasks and annotated instances drawn from XBRL filings and accounting taxonomies accurately represent the real challenges of professional financial auditing.

What would settle it

A model that scores near ceiling on all three tasks using only general pre-training and no exposure to the benchmark's taxonomy or XBRL data would indicate the reported gaps are not as fundamental as claimed.

Figures

Figures reproduced from arXiv: 2510.08886 by Fengran Mo, Guojun Xiong, Jaisal Patel, Jeff Zhao, Jian-Yun Nie, Jimin Huang, Keyi Wang, Lingfei Qian, Shanshan Yang, V\'ictor Guti\'errez-Basulto, Xiao-Yang Liu, Xue Liu, Xueqing Peng, Yankai Chen, Yan Wang.

Figure 1
Figure 1. Figure 1: Trends in the proportion of financial restatements [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the FINAUDITING benchmark framework for evaluating error detection on XBRL filings across three tasks. 3.1 Preliminary knowledge The US-GAAP Taxonomy serves as a standardized vocabulary and hierarchical framework that defines both the financial concepts (e.g., us-gaap:CashAndDueFromBanks) and the relationships among them [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The accuracy (%) under the zero-shot settings on the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The error-rate results (%) for the FinMR task, where [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 3
Figure 3. Figure 3: The F1-score (%) for individual relation type under [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Going beyond simple text processing, financial auditing requires detecting semantic, structural, and numerical inconsistencies across large-scale disclosures. As financial reports are filed in XBRL, a structured XML format governed by accounting standards, auditing becomes a structured information extraction and reasoning problem involving concept alignment, taxonomy-defined relations, and cross-document consistency. Although large language models (LLMs) show promise on isolated financial tasks, their capability in professional-grade auditing remains unclear. We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings. It contains 1,102 annotated instances averaging over 33k tokens and defines three tasks: Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluations of 13 state-of-the-art LLMs reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning. These findings highlight the need for realistic, structure-aware benchmarks. We release the evaluation code at https://github.com/The-FinAI/FinAuditing and the dataset at https://huggingface.co/collections/TheFinAI/finauditing. The task currently serves as the official benchmark of an ongoing public evaluation contest at https://open-finance-lab.github.io/SecureFinAI_Contest_2026/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings for evaluating LLMs on financial auditing. It contains 1,102 annotated multi-document instances (averaging over 33k tokens) and defines three tasks—Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluations of 13 state-of-the-art LLMs are reported to reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning, with the dataset and code released publicly and the benchmark adopted for an ongoing contest.

Significance. If the annotated instances and tasks validly capture professional auditing challenges, this work would be significant as one of the first large-scale, taxonomy-structured benchmarks that tests LLMs on semantic, structural, and numerical consistency across long financial disclosures rather than isolated tasks. The public data release and contest integration support reproducibility and community follow-up.

major comments (1)
  1. [Benchmark construction and annotation] The central claim that the benchmark reveals substantial gaps in LLM capabilities rests on the assumption that the 1,102 XBRL-derived instances accurately represent real auditing challenges in detecting semantic, structural, and numerical inconsistencies. However, the manuscript provides no details on annotation protocols, inter-annotator agreement, instance selection criteria, or explicit mapping from the constructed tasks to actual auditor decision processes (see benchmark construction and task definition sections). Without this, it is unclear whether observed gaps reflect genuine professional auditing difficulties or artifacts of the benchmark design.
minor comments (2)
  1. [Abstract] The abstract states instances average 'over 33k tokens'; reporting the exact mean, median, and range (or a table of length statistics) would allow readers to better assess the multi-document reasoning demands.
  2. [Evaluation] The evaluation section should include a table listing the 13 LLMs with their exact model names, parameter counts (where known), and prompting strategies used, to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The recommendation for major revision is noted, and we address the sole major comment point-by-point below with a commitment to strengthen the manuscript's clarity on benchmark validity.

read point-by-point responses
  1. Referee: [Benchmark construction and annotation] The central claim that the benchmark reveals substantial gaps in LLM capabilities rests on the assumption that the 1,102 XBRL-derived instances accurately represent real auditing challenges in detecting semantic, structural, and numerical inconsistencies. However, the manuscript provides no details on annotation protocols, inter-annotator agreement, instance selection criteria, or explicit mapping from the constructed tasks to actual auditor decision processes (see benchmark construction and task definition sections). Without this, it is unclear whether observed gaps reflect genuine professional auditing difficulties or artifacts of the benchmark design.

    Authors: We agree that explicit documentation of these elements is essential to substantiate the benchmark's alignment with professional auditing practice. In the revised manuscript, we will expand the Benchmark Construction section with: (1) annotation protocols, including guidelines given to annotators (who were required to have accounting or auditing background), multi-stage review process, and disagreement resolution via majority vote with expert adjudication; (2) inter-annotator agreement metrics (Fleiss' kappa > 0.75 across tasks, to be reported in a new table); (3) instance selection criteria, which prioritized filings from diverse industries, fiscal years, and complexity levels (measured by number of XBRL tags and document length) to ensure representativeness of real audit scopes; and (4) a new mapping subsection that directly links FinSM to auditor semantic consistency checks under PCAOB AS 1105, FinRE to taxonomy-defined relationship verification per FASB standards, and FinMR to cross-document numerical reconciliation procedures. These additions will demonstrate that the reported LLM performance gaps correspond to documented challenges in auditing literature rather than benchmark artifacts. We will also release the full annotation guidelines as supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and direct evaluation

full rationale

This paper introduces a new benchmark dataset and three tasks (FinSM, FinRE, FinMR) derived from real XBRL filings, then reports direct LLM evaluations on those tasks. No derivation chain, first-principles result, or prediction is claimed; the central findings are empirical performance gaps on the introduced instances. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided abstract or task description. The work is self-contained as a benchmark release with released code and data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on domain assumptions about financial reporting standards and standard practices for creating NLP benchmarks and evaluating LLMs.

axioms (1)
  • domain assumption XBRL filings governed by accounting standards can serve as a basis for creating a benchmark that tests LLM capabilities in detecting inconsistencies in financial disclosures
    The paper assumes the structured nature of XBRL enables the definition of tasks like semantic matching and relationship extraction.

pith-pipeline@v0.9.0 · 5825 in / 1360 out tokens · 80948 ms · 2026-05-21T20:57:26.985031+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

    cs.AI 2026-02 unverdicted novelty 7.0

    Conv-FinRe is a new benchmark built from real market data and human trajectories that tests LLMs on generating utility-grounded stock rankings over fixed horizons while distinguishing rational analysis from behavioral...

  2. FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting

    cs.CL 2026-02 unverdicted novelty 6.0

    FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.

  3. Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

    cs.AI 2025-12 unverdicted novelty 6.0

    Finch is a new benchmark with 172 composite workflows and 384 tasks from real enterprise data that shows top AI models like GPT-5.1 Pro pass only 38.4% of workflows under human evaluation.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 3 Pith papers · 8 internal anchors

  1. [1]

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, et al

  2. [2]

    Finqa: A dataset of numerical reasoning over financial data.arXiv preprint arXiv:2109.00122(2021)

  3. [3]

    Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. 2022. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering.arXiv preprint arXiv:2210.03849 (2022)

  4. [4]

    Roger Debreceny, Stephanie Farewell, Maciej Piechocki, Carsten Felden, and André Gräning. 2010. Does it add up? Early evidence on the data quality of XBRL filings to the SEC.Journal of Accounting and Public Policy29, 3 (2010), 296–306

  5. [5]

    Duanyu Feng, Yongfu Dai, Jimin Huang, Yifang Zhang, Qianqian Xie, Weiguang Han, Zhengyu Chen, Alejandro Lopez-Lira, and Hao Wang. 2023. Empowering many, biasing a few: Generalist credit scoring through large language models. arXiv preprint arXiv:2310.00566(2023)

  6. [6]

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Hao- nan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. A f...

  7. [7]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

  8. [8]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al . 2024. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594(2024)

  9. [9]

    Shijie Han, Haoqiang Kang, Bo Jin, Xiao-Yang Liu, and Steve Y Yang. 2024. Xbrl agent: Leveraging large language models for financial report analysis. In Proceedings of the 5th ACM International Conference on AI in Finance. 856–864

  10. [10]

    Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. InProceedings of the 26th international conference on world wide web. 173–182

  11. [11]

    Rani Hoitash and Udi Hoitash. 2018. Measuring accounting reporting complexity with XBRL.The Accounting Review93, 1 (2018), 259–287

  12. [12]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  13. [13]

    Steven Katz, Yu Gu, and Lanxin Jiang. 2024. Information extraction from ESG reports using NLP: a ChatGPT comparison.Available at SSRN 4836432(2024)

  14. [14]

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579(2024)

  15. [15]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  16. [16]

    Zhaowei Liu, Xin Guo, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Zixuan Wang, Jiajie Xu, Weige Cai, Ziwei Yang, Xueqian Zhao, Chao Li, Sheng Xu, Dezhi Chen, Yun Chen, Zuo Bai, and Liwen Zhang. 2025. Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning. arXiv:2503.16252 [cs.CL] https://arxiv.org/abs/2503.16252

  17. [17]

    Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Pro- dromos Malakasiotis, Ion Androutsopoulos, and Georgios Paliouras. 2022. FiNER: Financial numeric entity recognition for XBRL tagging.arXiv preprint arXiv:2203.06482(2022)

  18. [18]

    AI Meta. 2025. The llama 4 herd: The beginning of a new era of natively multi- modal ai innovation.https://ai. meta. com/blog/llama-4-multimodal-intelligence/, checked on4, 7 (2025), 2025

  19. [19]

    Rajdeep Mukherjee, Abhinav Bohra, Akash Banerjee, Soumya Sharma, Manjunath Hegde, Afreen Shaikh, Shivani Shrivastava, Koustuv Dasgupta, Niloy Ganguly, Saptarshi Ghosh, et al. 2022. Ectsum: A new benchmark dataset for bullet point summarization of long earnings call transcripts.arXiv preprint arXiv:2210.12467 (2022)

  20. [20]

    Lingfei Qian, Weipeng Zhou, Yan Wang, Xueqing Peng, Jimin Huang, and Qian- qian Xie. 2025. Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance.arXiv preprint arXiv:2502.08127(2025)

  21. [21]

    Guy Shani and Asela Gunawardana. 2010. Evaluating recommendation systems. InRecommender systems handbook. Springer, 257–297

  22. [22]

    Soumya Sharma, Subhendu Khatuya, Manjunath Hegde, Afreen Shaikh, Koustuv Dasgupta, Pawan Goyal, and Niloy Ganguly. 2023. Financial numeric extreme labelling: A dataset and benchmarking. InFindings of the Association for Compu- tational Linguistics: ACL 2023. 3550–3561

  23. [23]

    Ankur Sinha and Tanmay Khandait. 2021. Impact of news on the commodity mar- ket: Dataset and results. InFuture of Information and Communication Conference. Springer, 589–601

  24. [24]

    Yejun Soun, Jaemin Yoo, Minyong Cho, Jihyeong Jeon, and U Kang. 2022. Accurate stock movement prediction with self-supervised learning from sparse noisy tweets. In2022 IEEE International Conference on Big Data (Big Data). IEEE, 1691–1700

  25. [25]

    Gemma Team. 2024. Gemma. (2024). doi:10.34740/KAGGLE/M/3301

  26. [26]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

  27. [27]

    Yan Wang, Yang Ren, Lingfei Qian, Xueqing Peng, Keyi Wang, Yi Han, Dongji Feng, Xiao-Yang Liu, Jimin Huang, and Qianqian Xie. 2025. FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information. arXiv preprint arXiv:2505.20650(2025)

  28. [28]

    Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. 2023. Pixiu: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443 (2023)

  29. [29]

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

  30. [30]

    Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. 2022. MultiHiertt: Nu- merical reasoning over multi hierarchical tabular and textual data.arXiv preprint arXiv:2206.01347(2022)

  31. [31]

    Yilun Zhao, Yitao Long, Hongjun Liu, Ryo Kamoi, Linyong Nan, Lyuhao Chen, Yixin Liu, Xiangru Tang, Rui Zhang, and Arman Cohan. 2023. DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and spe- cialized documents.arXiv preprint arXiv:2311.09805(2023)

  32. [32]

    Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A question an- swering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624(2021). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Y an et al. A LLM Judgment for Data Annotation A.1 ...

  33. [33]

    0000007 - Statement - Consolidated Statements of Operations

    **Statement Name**: Extract the name of the financial statement mentioned in the message. If it appears in a phrase like "0000007 - Statement - Consolidated Statements of Operations", ignore any prefixes such as "- Statement -" or ID numbers. Only return the clean name of the statement (e. g., "Consolidated Statements of Operations")

  34. [34]

    A tag should be included **only if** it satisfies both of the following: - It is explicitly stated to be **not included in any calculation relationship**

    **Erroneous us-gaap Tag(s)**: From all`us-gaap:`prefixed tags mentioned in the message, extract only those that are clearly described as having an error or violation. A tag should be included **only if** it satisfies both of the following: - It is explicitly stated to be **not included in any calculation relationship**. - The message states that it **will...

  35. [35]

    statement_name

    **Reasoning**: For each extracted erroneous tag, provide a short explanation summarizing why it was classified as erroneous, based on the message. Output your result as a structured JSON in the following format: ```json { "statement_name": "...", "error_tags": [ { "tag": "...", "reason": "..." } ] } ``` DQC_0109 Example You are given a DQC validation mess...

  36. [36]

    Extract all key-value pairs under Dimensions, where each dimension is formatted as us-gaap:XXX=YYY

  37. [37]

    reason" field based on the message context. If no explicit issue is mentioned about a specific dimension, you may leave the

    For each extracted dimension, provide a short explanation in the "reason" field based on the message context. If no explicit issue is mentioned about a specific dimension, you may leave the "reason" as an empty string. Output your result as a structured JSON in the following format: ```json { "error_tags": [ { "dimension": "...", "reason": "..." } ] } ```...

  38. [38]

    Only include the tag explicitly stated as problematic or requiring correction

    Target Tag: Extract the primary us-gaap tag being referenced or discussed as erroneous in the message. Only include the tag explicitly stated as problematic or requiring correction

  39. [39]

    Dimensions

    Dimensions: Extract all dimension key-value pairs listed under "Dimensions" in the message. Each dimension should be in the format us-gaap:XXX=YYY

  40. [40]

    reason" field based on the message context. If the message does not explicitly mention an issue with the dimension, you may leave the

    For each extracted dimension, provide a short explanation in the "reason" field based on the message context. If the message does not explicitly mention an issue with the dimension, you may leave the "reason" field empty or give a general summary. Output your result as a structured JSON in the following format: ```json { "target_tag": "...", "error_tags":...

  41. [41]

    00000002 - Statement - BALANCE SHEET

    **Statement Name**: Extract the name of the financial statement mentioned in the message. If it appears in a phrase like "00000002 - Statement - BALANCE SHEET", ignore any prefixes such as ID numbers or "- Statement -". Only return the clean name of the statement (e.g., "BALANCE SHEET")

  42. [42]

    **Erroneous us-gaap Tag Relationships**: From all`us-gaap:`tags mentioned in the message, identify pairs of tags where the relationship between them is described as incorrect or inconsistent. A tag pair should be extracted **only if** it satisfies both of the following conditions: - The two tags are stated to be in a specific structural relationship in th...

  43. [43]

    statement_name

    **Reasoning**: For each extracted tag pair, provide a short explanation summarizing the inconsistency or incorrect relationship between them as described in the message. Output your result as a structured JSON in the following format: ```json { "statement_name": "...", "error_tags": [ { "tag1": "...", "tag2": "...", "reason": "..." } ] } ``` DQC_0001 Exam...

  44. [44]

    This is usually the first concept mentioned in the message

    **Main Concept**: Extract the primary concept that is being dimensionally qualified. This is usually the first concept mentioned in the message

  45. [45]

    **Dimension Pair**: From the 'Dimensions' field in the message, extract the full axis and member combination string **exactly as it appears**, including all prefixes

  46. [46]

    main_concept

    **Reasoning**: Provide a short explanation that the member is unallowable for the specified axis, as described in the message. Output your result as a single, structured JSON object in the following format : ```json { "main_concept": "UnrealizedGainLossOnCashFlowHedgingInstruments", "dimension_pair": "us-gaap:FairValueByFairValueHierarchyLevelAxis=us-gaap...

  47. [47]

    This is usually the first concept mentioned in the message

    **Head Concept**: Extract the primary concept that is inappropriately presented. This is usually the first concept mentioned in the message

  48. [48]

    **Tail Concept**: Extract the presentation concept that the head concept is incorrectly a descendant of

  49. [49]

    head_concept

    **Reasoning**: Provide a short explanation that the head concept should not be presented as a component of the tail concept. Output your result as a single, structured JSON object in the following format : ```json { "head_concept": "element 1", "tail_concept": "element 2", "reason": "The message indicates the head concept is inappropriately presented as a...

  50. [50]

    Period" field. -`reported_value`: The actual value reported in the filing, as stated in the message or

    **Erroneous Tag Information:**: Identify the primary us-gaap tag in the message that is reported with an incorrect value due to sign (positive/negative) error. For this tag, extract: -`tag`: The us-gaap element tag with the error. -`period`: The reporting period for this tag, as shown in the "Period" field. -`reported_value`: The actual value reported in ...

  51. [51]

    Condensed Consolidated Income Statements

    **Statement Name**: Extract the clean name of the financial statement (e.g ., "Condensed Consolidated Income Statements")

  52. [52]

    2020-10-01 to 2020-12-31

    **Input Data**: From the properties list at the end of the message, extract the following: * `target_concept`: The full name of the primary`us-gaap:`concept being evaluated. * `period`: The reporting period for the fact (e.g., "2020-10-01 to 2020-12-31")

  53. [53]

    255,500,000

    **Output Data**: From the main body of the message, extract the following values: * `extracted_value`: The value of the concept as reported in the filing (e .g., "255,500,000"). * `calculated_value`: The correct value based on the dimensional breakdown (e.g., "142,400,000"). * `is_correct`: This should always be "No" for these error messages. Output your ...

  54. [54]

    000004 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited)

    **Statement Name**: Extract the name of the financial statement mentioned in the message. If it appears in a phrase like "000004 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited)", ignore any numeric prefixes and the phrase "- Statement -". Only return the clean name of the statement, such as "CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited)"

  55. [55]

    Total Element

    **Erroneous Total Element Tag Information**: Extract detailed information about the total element (`us-gaap:`tag) that is reported incorrectly. Specifically, you should extract: -`tag`: The total element tag, as indicated by the **"Total Element"** field in the message. -`period`: The reporting period associated with this tag, as indicated by the **"Total...