pith. sign in

arxiv: 2605.17962 · v1 · pith:CRTMFMMVnew · submitted 2026-05-18 · 💻 cs.CE

FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation

Pith reviewed 2026-05-20 00:25 UTC · model grok-4.3

classification 💻 cs.CE
keywords financial multimodal reasoningdocument-level benchmarklarge multimodal modelsfinancial reportsvisual groundingnumerical estimationcross-page reasoning
0
0 comments X

The pith

A new benchmark shows no large multimodal model exceeds 65 percent accuracy on document-level financial reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates FinDocMRE, a benchmark of 12,207 samples drawn from 2,878 real financial reports across twelve domains. It evaluates how well multimodal models integrate text, tables, and images for five types of reasoning tasks at the full document level. Tests on eleven representative models find none reach an overall score above 65, with clear drops on numerical estimation and cross-page visual grounding. This gap matters because financial work routinely requires connecting scattered data inside lengthy reports rather than isolated charts. The benchmark is built to drive progress toward models that can perform expert-level document analysis.

Core claim

The central claim is that FinDocMRE, constructed through a semi-automated pipeline of Visual-Centric Generation followed by Expert Verification, provides a high-quality test set of 12,207 samples that reveals fundamental limits in current large multimodal models. When eleven models are evaluated, none surpass an overall score of 65; models handle semantic narrative construction more readily but consistently underperform on numerical estimation and cross-page visual grounding within complex multi-image financial documents.

What carries the argument

The FinDocMRE benchmark, which supplies multi-image document-level tasks spanning five reasoning types drawn from real financial reports.

If this is right

  • Models must improve simultaneous visual grounding and logical reasoning across multiple pages of a single document.
  • Targeted gains are needed in numerical estimation tasks that combine table data with surrounding text and figures.
  • The benchmark can serve as a development target for specialized financial multimodal systems.
  • Performance differences across the five task types indicate that uniform training approaches leave specific weaknesses unaddressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar document-level benchmarks may be required for other high-stakes domains that mix text, tables, and images.
  • Training regimes that explicitly reward cross-page consistency could raise scores without changing model scale.
  • The observed ceiling suggests that current architectures may need new mechanisms for maintaining context across long financial reports.

Load-bearing premise

The semi-automated pipeline of visual-centric generation plus expert verification actually removes text bias and delivers annotation quality high enough to support strong conclusions about model limitations.

What would settle it

A new model that scores above 65 overall on the full FinDocMRE test set while closing the gaps on numerical estimation and cross-page grounding tasks would falsify the reported performance ceiling.

Figures

Figures reproduced from arXiv: 2605.17962 by Dawei Cheng, Feng Yu, Jiangtong Li, Jiayong Zhu, Jie Xu, Jinru Ding.

Figure 1
Figure 1. Figure 1: The overall taxonomy and statistics of the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The annotation pipeline of FINDOCMRE benchmark . 3 FINDOCMRE Benchmark As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dataset statistics overview: (a) Reasoning types, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of visual context on objective accuracy. While all models benefit from Cropped Images, Bounding Box annotations degrade the performance of advanced models, due to occlu￾sion or visual noise [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of GPT-5 and Qwen3-VL-30B on a cross-page calculation task. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of PDF Length 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of Image Resolution (DPI). Advanced models (e.g., Doubao, GPT-5) show minimal gains beyond 110 DPI, supporting the experimental setting. Other models (e.g., Qwen3- Max) remain limited by reasoning capabilities regardless of resolution. same thematic section (e.g., adjacent pages in the MD&A chapter). These characteristics enables models to process local information clusters without grounding dispara… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of GPT-5 and Qwen3-VL-30B on Multiple Choice task. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of GPT-5 and Qwen3-VL-30B on Open Ended task. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

While Large Multimodal Models (LMMs) excel in general visual tasks, their deployment in specialized financial contexts remains insufficient. Existing benchmarks prioritize isolated charts, often overlooking the need to integrate data from text, tables, and images within comprehensive financial documents. To address this limitation, we introduce FINDOCMRE, a multi-image document-level benchmark designed for financial multimodal reasoning. We construct the dataset via a semi-automated pipeline that combines Visual-Centric Generation with Expert Verification, thereby minimizing text bias and ensuring high annotation quality. Spanning twelve domains, the benchmark comprises 12,207 samples derived from 2,878 financial reports, designed to evaluate multi-image processing and document-level understanding across five distinct task types. Extensive experiments with eleven representative LMMs reveal that no model surpasses an overall score of 65, highlighting challenges in integrating visual grounding with logical reasoning within complex document environments. Specifically, we observe a significant performance divergence across tasks, where models exhibit proficiency in semantic narrative construction but struggle with numerical estimation and cross-page visual grounding. FINDOCMRE serves as a rigorous benchmark to guide the evolution of financial LMMs towards expert-level document analysis and reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces FinDocMRE, a multi-image document-level benchmark for financial multimodal reasoning. It is constructed via a semi-automated pipeline of Visual-Centric Generation combined with Expert Verification, yielding 12,207 samples from 2,878 financial reports across twelve domains and five task types. Experiments evaluate eleven representative LMMs and report that no model exceeds an overall score of 65, with models showing relative strength in semantic narrative tasks but weakness in numerical estimation and cross-page visual grounding.

Significance. If the dataset annotations are shown to be reliable and free of systematic artifacts, the benchmark would meaningfully extend existing evaluations by targeting integrated multimodal reasoning over full financial documents rather than isolated charts or text. The reported performance ceiling and task-specific gaps could then usefully guide development of LMMs for domain-specific document analysis.

major comments (1)
  1. [Abstract and Dataset Construction] Abstract and Dataset Construction: The claim that the semi-automated pipeline 'minimizes text bias and ensures high annotation quality' is presented without supporting quantitative evidence such as inter-annotator agreement, expert correction rates, or rejection statistics from the 2,878 reports. Because the central result (no LMM exceeds 65 overall, with specific struggles in numerical and cross-page tasks) rests on the assumption that the 12,207 samples validly test visual grounding and logical reasoning, the absence of these metrics leaves open the possibility that annotation artifacts or leakage depress scores and exaggerate the reported challenges.
minor comments (1)
  1. [Abstract] Abstract: Specify the exact meaning and scale of the 'overall score of 65' (e.g., percentage, normalized accuracy) to prevent reader ambiguity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments. We address the major comment below and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: The claim that the semi-automated pipeline 'minimizes text bias and ensures high annotation quality' is presented without supporting quantitative evidence such as inter-annotator agreement, expert correction rates, or rejection statistics from the 2,878 reports. Because the central result (no LMM exceeds 65 overall, with specific struggles in numerical and cross-page tasks) rests on the assumption that the 12,207 samples validly test visual grounding and logical reasoning, the absence of these metrics leaves open the possibility that annotation artifacts or leakage depress scores and exaggerate the reported challenges.

    Authors: We agree that the current manuscript would benefit from quantitative evidence on the verification stage. In the revised version we will add a dedicated subsection to the Dataset Construction section that reports the available statistics from the expert verification process, including inter-annotator agreement on a sampled subset, the fraction of generated samples that required expert corrections, and the rejection rate among the initial 2,878 reports. These additions will directly support the claims of annotation quality and allow readers to assess the risk of artifacts or leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and direct model evaluation are self-contained

full rationale

The paper presents a new dataset and reports empirical scores from testing eleven LMMs on it. No mathematical derivations, fitted parameters, or predictions appear in the provided text; the overall score ceiling of 65 is a direct measurement on the 12,207 samples rather than a quantity that reduces to prior inputs by construction. Dataset creation via the described semi-automated pipeline is a methodological step, not a derived claim justified by self-citation or ansatz. No load-bearing uniqueness theorems or renamings of known results are invoked. The evaluation therefore stands as an independent empirical observation against external models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark papers rest on assumptions about data representativeness and annotation quality rather than mathematical axioms or new entities.

axioms (1)
  • domain assumption Expert verification after visual-centric generation produces annotations free of significant text bias and of high quality.
    Invoked in the dataset construction description to justify the 12,207 samples.

pith-pipeline@v0.9.0 · 5750 in / 1217 out tokens · 38510 ms · 2026-05-20T00:25:05.424720+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 12 internal anchors

  1. [1]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., E. Bieber, M. Schaekermann, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  2. [2]

    Bai, S., K. Chen, X. Liu, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Chen, Z., W. Chen, C. Smiley, et al. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711. 2021

  4. [4]

    Ding, J., C. Ding, W. Pang, et al. Cnfinbench: A benchmark for safety and compliance of large language models in finance.arXiv preprint arXiv:2512.09506, 2025

  5. [5]

    Xue, S., X. Li, F. Zhou, et al. Famma: A benchmark for financial domain multilingual multimodal question answering.arXiv preprint arXiv:2410.04526, 2024

  6. [6]

    Karamcheti, S., S. Nair, A. Balakrishna, et al. Prismatic vlms: Investigating the design space of visually- conditioned language models. InForty-first International Conference on Machine Learning. 2024

  7. [7]

    Huang, W., H. Liu, M. Guo, et al. Visual hallucinations of multi-modal large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9614–9631. 2024

  8. [8]

    GPT-4 Technical Report

    Achiam, J., S. Adler, S. Agarwal, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  9. [9]

    The claude 3 model family: Opus, sonnet, haiku, 2024

    Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. Anthropic Model Card

  10. [10]

    Zeng, A., X. Lv, Q. Zheng, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

  11. [11]

    Guo, D., F. Wu, F. Zhu, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  12. [12]

    Liu, H., C. Li, Q. Wu, et al. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  13. [13]

    Zhu, J., W. Wang, Z. Chen, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  14. [14]

    Fu, C., P. Chen, Y . Shen, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

  15. [15]

    Wang, K., J. Pan, W. Shi, et al. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095–95169, 2024

  16. [16]

    Jiang, Y

    Zhang, R., D. Jiang, Y . Zhang, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024

  17. [17]

    Yue, X., Y . Ni, K. Zhang, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567. 2024

  18. [18]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Lu, P., H. Bansal, T. Xia, et al. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  19. [19]

    Masry, A., X. L. Do, J. Q. Tan, et al. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279. 2022. 10

  20. [20]

    Chen, L., J. Li, X. Dong, et al. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  21. [21]

    Liu, Y ., H. Duan, Y . Zhang, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  22. [22]

    Xu, Z., S. Du, Y . Qi, et al. Chartbench: A benchmark for complex visual reasoning in charts.arXiv preprint arXiv:2312.15915, 2023

  23. [23]

    Liu, F., X. Wang, W. Yao, et al. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1287–1310. 2024

  24. [24]

    Karatzas, C

    Mathew, M., D. Karatzas, C. Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209. 2021

  25. [25]

    Shah, R. S., K. Chawla, D. Eidnani, et al. When flue meets flang: Benchmarks and large pre-trained language model for financial domain. InEMNLP. 2022

  26. [26]

    Lu, D., H. Wu, J. Liang, et al. Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark.arXiv preprint arXiv:2302.09432, 2023

  27. [27]

    Li, H., Y . Cao, Y . Yu, et al. Investorbench: A benchmark for financial decision-making tasks with llm-based agent.arXiv preprint arXiv:2412.18174, 2024

  28. [28]

    FinanceBench: A New Benchmark for Financial Question Answering

    Islam, P., A. Kannappan, D. Kiela, et al. Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944, 2023

  29. [29]

    Zhang, L., W. Cai, Z. Liu, et al. Fineval: A chinese financial domain knowledge evaluation benchmark for large language models.arXiv preprint arXiv:2308.09975, 2023

  30. [30]

    Xie, Q., W. Han, Z. Chen, et al. Finben: A holistic financial benchmark for large language models. In NeurIPS, pages 95716–95743. 2024

  31. [31]

    Lei, Y ., J. Li, D. Cheng, et al. Cfbenchmark: Chinese financial assistant benchmark for large language model.arXiv preprint arXiv:2311.05812, 2023

  32. [32]

    Xu, L., L. Zhu, Y . Wu, et al. Superclue-fin: Graded fine-grained analysis of chinese llms on diverse financial tasks and applications.arXiv preprint arXiv:2404.19063, 2024

  33. [33]

    Openfindata: The open-source financial evaluation dataset for large language models, 2023

    OpenCompass Project. Openfindata: The open-source financial evaluation dataset for large language models, 2023. Available athttps://github.com/open-compass/OpenFinData

  34. [34]

    Li, J., Y . Zhu, D. Cheng, et al. Cfbenchmark-mm: Chinese financial assistant benchmark for multimodal large language model.arXiv preprint arXiv:2506.13055, 2025

  35. [35]

    Zhang, H

    Gan, Z., D. Zhang, H. Li, et al. Mme-finance: A multimodal finance benchmark for expert-level under- standing and reasoning. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12867–12874. 2025

  36. [36]

    Luo, J., Z. Kou, L. Yang, et al. Finmme: Benchmark dataset for financial multi-modal reasoning evaluation. arXiv preprint arXiv:2505.24714, 2025

  37. [37]

    Deng, S., H. Peng, J. Xu, et al. Finmr: A knowledge-intensive multimodal benchmark for advanced financial reasoning. InProceedings of the 6th ACM International Conference on AI in Finance, pages 168–176. 2025

  38. [38]

    FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting

    Zhu, Y ., Y . Jiang, Z. Xu, et al. From comprehension to reasoning: A hierarchical benchmark for automated financial research reporting.arXiv preprint arXiv:2603.19254, 2026

  39. [39]

    Hu, Y ., Y . Li, P. Liu, et al. Fintsb: A comprehensive and practical benchmark for financial time series forecasting.arXiv preprint arXiv:2502.18834, 2025

  40. [40]

    Haihong, R

    Tang, Z., E. Haihong, R. Li, et al. Finmmdocr: Benchmarking financial multimodal reasoning with scenario awareness, document understanding, and multi-step computation. InProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pages 25858–25866. 2026

  41. [41]

    Gpt-5 is here, 2025

    Team, G.-. Gpt-5 is here, 2025. 11

  42. [42]

    Yang, A., A. Li, B. Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  43. [43]

    Grok 4.1 fast and agent tools api, 2025

    Team, G. Grok 4.1 fast and agent tools api, 2025

  44. [44]

    Roukos, T

    Papineni, K., S. Roukos, T. Ward, et al. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318. 2002

  45. [45]

    Jiang, L

    Li, D., B. Jiang, L. Huang, et al. From generation to judgment: Opportunities and challenges of llm-as-a- judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791. 2025

  46. [46]

    Anderson, L. W., D. R. Krathwohl.A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives: complete edition. Addison Wesley Longman, Inc., 2001. 12 Sec. A elaborates on the construction of FINDOCMRE, documenting prompts and settings. Sec. B outlines the evaluation protocol, including configuration and scor...

  47. [47]

    We apply geometric constraints to eliminate layout artifacts, discarding images with low resolutions or extreme aspect ratios

  48. [48]

    This hybrid approach enables the deduplication of visual content and the removal of uninformative icons or corporate logos

    We integrate image similarity computation with OCR to remove repetitive non-data ele- ments. This hybrid approach enables the deduplication of visual content and the removal of uninformative icons or corporate logos

  49. [49]

    Figure x

    We perform page-level textual indexing verification. By scanning the text of corresponding pages, we retain only images explicitly referenced by narrative markers (e.g., “Figure x”), ensuring that all “Cleaned Figures“ are linked to the document context. A.2 Visual-Centric Generation To reduce textual bias and ground reasoning in visual evidence, we use a...

  50. [50]

    Form Serves Function:Strictly adhere to question formats: single_choice, multiple_choice, numerical_precise (Calculations must useexplicit valuesfound directly on charts/tables without ambiguity), numerical_approximate (Reasoning requiresvisual estimation e.g., reading axis height where precise labels are absent), and open_ended (Pure text answer; strictl...

  51. [51]

    Evaluate Reasoning, Not Memorization:The primary objective is evaluating deep analytical, reasoning, and synthesis skills

  52. [52]

    Blind Stem Principle:The stem is strictly forbidden from mentioningchart_id so users emulate real-world blind queries

  53. [53]

    Promote Comprehensive Analysis:Encourage the design of complex questions that require integrating partial information from multiple distinct charts

  54. [54]

    Information Silo Principle:All charts must be treated as originating from a fictional, non-public context; do not use external knowledge

  55. [55]

    Use relative years (e.g., ’Year 1’, ’Year 2’)

    Abstract Time Principle:Avoid real dates. Use relative years (e.g., ’Year 1’, ’Year 2’). If this conflicts with ’Event Anchoring’, the latter takes precedence

  56. [56]

    Quantitative Anchor Principle:Answers must be uniquely determined by specific information in the charts, avoiding ambiguous estimation scenarios

  57. [57]

    when revenue peaked

    Event Anchoring Principle:Prioritize using specific events (e.g., "when revenue peaked") to lock time points across multiple charts

  58. [58]

    Year 1"). Understand connections between charts. Step 2: Mine Scenarios: Prioritize

    Context-Free Stem Principle:The stem must be clear and unambiguous, ensuring solvability whether the input is the full PDF or filtered images. # Classification Tags When generating each question object, you must also add the following two classification tags. The definitions are strict: 1.5 reasoning_type(Reasoning Type):[Must choose one] • Quantitative C...

  59. [59]

    OutputONLYthe required JSON content, with no other explanatory text or information

  60. [60]

    answer" and the value as a string. ◦Example for single_choice: {

    The format for each question’s answer is a dictionary with the key "answer" and the value as a string. ◦Example for single_choice: {"answer": "C"} ◦Example for multiple_choice: {"answer": "ABD"} ◦ Example for numerical_precise/approximate: {"answer": "12.3"} (No units. Round or format decimals as the question requires). ◦Example for open_ended: {"answer":...

  61. [61]

    answer":

    The outermost structure must be a list of these dictionary results, in the same order as the questions. A strict reference for the output format is as follows (do not reference the content, only the format): 1[ 2{"answer": "C" }, 3{"answer": "ABD" }, 4{"answer": "12.3" }, 5{"answer": "This is a text answer" } 6] Here is the set of financial questions to b...