FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation

Dawei Cheng; Feng Yu; Jiangtong Li; Jiayong Zhu; Jie Xu; Jinru Ding

arxiv: 2605.17962 · v1 · pith:CRTMFMMVnew · submitted 2026-05-18 · 💻 cs.CE

FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation

Jiayong Zhu , Jiangtong Li , Jinru Ding , Dawei Cheng , Jie Xu , Feng Yu This is my paper

Pith reviewed 2026-05-20 00:25 UTC · model grok-4.3

classification 💻 cs.CE

keywords financial multimodal reasoningdocument-level benchmarklarge multimodal modelsfinancial reportsvisual groundingnumerical estimationcross-page reasoning

0 comments

The pith

A new benchmark shows no large multimodal model exceeds 65 percent accuracy on document-level financial reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates FinDocMRE, a benchmark of 12,207 samples drawn from 2,878 real financial reports across twelve domains. It evaluates how well multimodal models integrate text, tables, and images for five types of reasoning tasks at the full document level. Tests on eleven representative models find none reach an overall score above 65, with clear drops on numerical estimation and cross-page visual grounding. This gap matters because financial work routinely requires connecting scattered data inside lengthy reports rather than isolated charts. The benchmark is built to drive progress toward models that can perform expert-level document analysis.

Core claim

The central claim is that FinDocMRE, constructed through a semi-automated pipeline of Visual-Centric Generation followed by Expert Verification, provides a high-quality test set of 12,207 samples that reveals fundamental limits in current large multimodal models. When eleven models are evaluated, none surpass an overall score of 65; models handle semantic narrative construction more readily but consistently underperform on numerical estimation and cross-page visual grounding within complex multi-image financial documents.

What carries the argument

The FinDocMRE benchmark, which supplies multi-image document-level tasks spanning five reasoning types drawn from real financial reports.

If this is right

Models must improve simultaneous visual grounding and logical reasoning across multiple pages of a single document.
Targeted gains are needed in numerical estimation tasks that combine table data with surrounding text and figures.
The benchmark can serve as a development target for specialized financial multimodal systems.
Performance differences across the five task types indicate that uniform training approaches leave specific weaknesses unaddressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar document-level benchmarks may be required for other high-stakes domains that mix text, tables, and images.
Training regimes that explicitly reward cross-page consistency could raise scores without changing model scale.
The observed ceiling suggests that current architectures may need new mechanisms for maintaining context across long financial reports.

Load-bearing premise

The semi-automated pipeline of visual-centric generation plus expert verification actually removes text bias and delivers annotation quality high enough to support strong conclusions about model limitations.

What would settle it

A new model that scores above 65 overall on the full FinDocMRE test set while closing the gaps on numerical estimation and cross-page grounding tasks would falsify the reported performance ceiling.

Figures

Figures reproduced from arXiv: 2605.17962 by Dawei Cheng, Feng Yu, Jiangtong Li, Jiayong Zhu, Jie Xu, Jinru Ding.

**Figure 2.** Figure 2: The annotation pipeline of FINDOCMRE benchmark . 3 FINDOCMRE Benchmark As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Dataset statistics overview: (a) Reasoning types, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of visual context on objective accuracy. While all models benefit from Cropped Images, Bounding Box annotations degrade the performance of advanced models, due to occlusion or visual noise [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of GPT-5 and Qwen3-VL-30B on a cross-page calculation task. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of PDF Length 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of Image Resolution (DPI). Advanced models (e.g., Doubao, GPT-5) show minimal gains beyond 110 DPI, supporting the experimental setting. Other models (e.g., Qwen3- Max) remain limited by reasoning capabilities regardless of resolution. same thematic section (e.g., adjacent pages in the MD&A chapter). These characteristics enables models to process local information clusters without grounding dispara… view at source ↗

**Figure 8.** Figure 8: Comparison of GPT-5 and Qwen3-VL-30B on Multiple Choice task. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of GPT-5 and Qwen3-VL-30B on Open Ended task. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

read the original abstract

While Large Multimodal Models (LMMs) excel in general visual tasks, their deployment in specialized financial contexts remains insufficient. Existing benchmarks prioritize isolated charts, often overlooking the need to integrate data from text, tables, and images within comprehensive financial documents. To address this limitation, we introduce FINDOCMRE, a multi-image document-level benchmark designed for financial multimodal reasoning. We construct the dataset via a semi-automated pipeline that combines Visual-Centric Generation with Expert Verification, thereby minimizing text bias and ensuring high annotation quality. Spanning twelve domains, the benchmark comprises 12,207 samples derived from 2,878 financial reports, designed to evaluate multi-image processing and document-level understanding across five distinct task types. Extensive experiments with eleven representative LMMs reveal that no model surpasses an overall score of 65, highlighting challenges in integrating visual grounding with logical reasoning within complex document environments. Specifically, we observe a significant performance divergence across tasks, where models exhibit proficiency in semantic narrative construction but struggle with numerical estimation and cross-page visual grounding. FINDOCMRE serves as a rigorous benchmark to guide the evolution of financial LMMs towards expert-level document analysis and reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FinDocMRE adds a new benchmark focused on multi-image financial document reasoning, but the lack of annotation quality metrics leaves the performance claims hard to evaluate.

read the letter

The main point is that this paper builds a benchmark for testing multimodal models on full financial reports that mix text, tables, and images across pages. They collected 12,207 samples from 2,878 reports in 12 domains and defined five task types that include narrative construction, numerical estimation, and cross-page grounding. When they tested eleven LMMs, none reached an overall score above 65, with clear drops on the numerical and linking tasks.

Referee Report

1 major / 1 minor

Summary. The paper introduces FinDocMRE, a multi-image document-level benchmark for financial multimodal reasoning. It is constructed via a semi-automated pipeline of Visual-Centric Generation combined with Expert Verification, yielding 12,207 samples from 2,878 financial reports across twelve domains and five task types. Experiments evaluate eleven representative LMMs and report that no model exceeds an overall score of 65, with models showing relative strength in semantic narrative tasks but weakness in numerical estimation and cross-page visual grounding.

Significance. If the dataset annotations are shown to be reliable and free of systematic artifacts, the benchmark would meaningfully extend existing evaluations by targeting integrated multimodal reasoning over full financial documents rather than isolated charts or text. The reported performance ceiling and task-specific gaps could then usefully guide development of LMMs for domain-specific document analysis.

major comments (1)

[Abstract and Dataset Construction] Abstract and Dataset Construction: The claim that the semi-automated pipeline 'minimizes text bias and ensures high annotation quality' is presented without supporting quantitative evidence such as inter-annotator agreement, expert correction rates, or rejection statistics from the 2,878 reports. Because the central result (no LMM exceeds 65 overall, with specific struggles in numerical and cross-page tasks) rests on the assumption that the 12,207 samples validly test visual grounding and logical reasoning, the absence of these metrics leaves open the possibility that annotation artifacts or leakage depress scores and exaggerate the reported challenges.

minor comments (1)

[Abstract] Abstract: Specify the exact meaning and scale of the 'overall score of 65' (e.g., percentage, normalized accuracy) to prevent reader ambiguity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments. We address the major comment below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: The claim that the semi-automated pipeline 'minimizes text bias and ensures high annotation quality' is presented without supporting quantitative evidence such as inter-annotator agreement, expert correction rates, or rejection statistics from the 2,878 reports. Because the central result (no LMM exceeds 65 overall, with specific struggles in numerical and cross-page tasks) rests on the assumption that the 12,207 samples validly test visual grounding and logical reasoning, the absence of these metrics leaves open the possibility that annotation artifacts or leakage depress scores and exaggerate the reported challenges.

Authors: We agree that the current manuscript would benefit from quantitative evidence on the verification stage. In the revised version we will add a dedicated subsection to the Dataset Construction section that reports the available statistics from the expert verification process, including inter-annotator agreement on a sampled subset, the fraction of generated samples that required expert corrections, and the rejection rate among the initial 2,878 reports. These additions will directly support the claims of annotation quality and allow readers to assess the risk of artifacts or leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and direct model evaluation are self-contained

full rationale

The paper presents a new dataset and reports empirical scores from testing eleven LMMs on it. No mathematical derivations, fitted parameters, or predictions appear in the provided text; the overall score ceiling of 65 is a direct measurement on the 12,207 samples rather than a quantity that reduces to prior inputs by construction. Dataset creation via the described semi-automated pipeline is a methodological step, not a derived claim justified by self-citation or ansatz. No load-bearing uniqueness theorems or renamings of known results are invoked. The evaluation therefore stands as an independent empirical observation against external models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark papers rest on assumptions about data representativeness and annotation quality rather than mathematical axioms or new entities.

axioms (1)

domain assumption Expert verification after visual-centric generation produces annotations free of significant text bias and of high quality.
Invoked in the dataset construction description to justify the 12,207 samples.

pith-pipeline@v0.9.0 · 5750 in / 1217 out tokens · 38510 ms · 2026-05-20T00:25:05.424720+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct the dataset via a semi-automated pipeline that combines Visual-Centric Generation with Expert Verification
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

no model surpasses an overall score of 65

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 12 internal anchors

[1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., E. Bieber, M. Schaekermann, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Bai, S., K. Chen, X. Liu, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Chen, Z., W. Chen, C. Smiley, et al. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711. 2021

work page 2021
[4]

Ding, J., C. Ding, W. Pang, et al. Cnfinbench: A benchmark for safety and compliance of large language models in finance.arXiv preprint arXiv:2512.09506, 2025

work page arXiv 2025
[5]

Xue, S., X. Li, F. Zhou, et al. Famma: A benchmark for financial domain multilingual multimodal question answering.arXiv preprint arXiv:2410.04526, 2024

work page arXiv 2024
[6]

Karamcheti, S., S. Nair, A. Balakrishna, et al. Prismatic vlms: Investigating the design space of visually- conditioned language models. InForty-first International Conference on Machine Learning. 2024

work page 2024
[7]

Huang, W., H. Liu, M. Guo, et al. Visual hallucinations of multi-modal large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9614–9631. 2024

work page 2024
[8]

GPT-4 Technical Report

Achiam, J., S. Adler, S. Agarwal, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

The claude 3 model family: Opus, sonnet, haiku, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. Anthropic Model Card

work page 2024
[10]

Zeng, A., X. Lv, Q. Zheng, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Guo, D., F. Wu, F. Zhu, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Liu, H., C. Li, Q. Wu, et al. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[13]

Zhu, J., W. Wang, Z. Chen, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Fu, C., P. Chen, Y . Shen, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Wang, K., J. Pan, W. Shi, et al. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095–95169, 2024

work page 2024
[16]

Jiang, Y

Zhang, R., D. Jiang, Y . Zhang, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024

work page 2024
[17]

Yue, X., Y . Ni, K. Zhang, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567. 2024

work page 2024
[18]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., H. Bansal, T. Xia, et al. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Masry, A., X. L. Do, J. Q. Tan, et al. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279. 2022. 10

work page 2022
[20]

Chen, L., J. Li, X. Dong, et al. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024
[21]

Liu, Y ., H. Duan, Y . Zhang, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[22]

Xu, Z., S. Du, Y . Qi, et al. Chartbench: A benchmark for complex visual reasoning in charts.arXiv preprint arXiv:2312.15915, 2023

work page arXiv 2023
[23]

Liu, F., X. Wang, W. Yao, et al. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1287–1310. 2024

work page 2024
[24]

Karatzas, C

Mathew, M., D. Karatzas, C. Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209. 2021

work page 2021
[25]

Shah, R. S., K. Chawla, D. Eidnani, et al. When flue meets flang: Benchmarks and large pre-trained language model for financial domain. InEMNLP. 2022

work page 2022
[26]

Lu, D., H. Wu, J. Liang, et al. Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark.arXiv preprint arXiv:2302.09432, 2023

work page arXiv 2023
[27]

Li, H., Y . Cao, Y . Yu, et al. Investorbench: A benchmark for financial decision-making tasks with llm-based agent.arXiv preprint arXiv:2412.18174, 2024

work page arXiv 2024
[28]

FinanceBench: A New Benchmark for Financial Question Answering

Islam, P., A. Kannappan, D. Kiela, et al. Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Zhang, L., W. Cai, Z. Liu, et al. Fineval: A chinese financial domain knowledge evaluation benchmark for large language models.arXiv preprint arXiv:2308.09975, 2023

work page arXiv 2023
[30]

Xie, Q., W. Han, Z. Chen, et al. Finben: A holistic financial benchmark for large language models. In NeurIPS, pages 95716–95743. 2024

work page 2024
[31]

Lei, Y ., J. Li, D. Cheng, et al. Cfbenchmark: Chinese financial assistant benchmark for large language model.arXiv preprint arXiv:2311.05812, 2023

work page arXiv 2023
[32]

Xu, L., L. Zhu, Y . Wu, et al. Superclue-fin: Graded fine-grained analysis of chinese llms on diverse financial tasks and applications.arXiv preprint arXiv:2404.19063, 2024

work page arXiv 2024
[33]

Openfindata: The open-source financial evaluation dataset for large language models, 2023

OpenCompass Project. Openfindata: The open-source financial evaluation dataset for large language models, 2023. Available athttps://github.com/open-compass/OpenFinData

work page 2023
[34]

Li, J., Y . Zhu, D. Cheng, et al. Cfbenchmark-mm: Chinese financial assistant benchmark for multimodal large language model.arXiv preprint arXiv:2506.13055, 2025

work page arXiv 2025
[35]

Zhang, H

Gan, Z., D. Zhang, H. Li, et al. Mme-finance: A multimodal finance benchmark for expert-level under- standing and reasoning. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12867–12874. 2025

work page 2025
[36]

Luo, J., Z. Kou, L. Yang, et al. Finmme: Benchmark dataset for financial multi-modal reasoning evaluation. arXiv preprint arXiv:2505.24714, 2025

work page arXiv 2025
[37]

Deng, S., H. Peng, J. Xu, et al. Finmr: A knowledge-intensive multimodal benchmark for advanced financial reasoning. InProceedings of the 6th ACM International Conference on AI in Finance, pages 168–176. 2025

work page 2025
[38]

FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting

Zhu, Y ., Y . Jiang, Z. Xu, et al. From comprehension to reasoning: A hierarchical benchmark for automated financial research reporting.arXiv preprint arXiv:2603.19254, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Hu, Y ., Y . Li, P. Liu, et al. Fintsb: A comprehensive and practical benchmark for financial time series forecasting.arXiv preprint arXiv:2502.18834, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Haihong, R

Tang, Z., E. Haihong, R. Li, et al. Finmmdocr: Benchmarking financial multimodal reasoning with scenario awareness, document understanding, and multi-step computation. InProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pages 25858–25866. 2026

work page 2026
[41]

Gpt-5 is here, 2025

Team, G.-. Gpt-5 is here, 2025. 11

work page 2025
[42]

Yang, A., A. Li, B. Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Grok 4.1 fast and agent tools api, 2025

Team, G. Grok 4.1 fast and agent tools api, 2025

work page 2025
[44]

Roukos, T

Papineni, K., S. Roukos, T. Ward, et al. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318. 2002

work page 2002
[45]

Jiang, L

Li, D., B. Jiang, L. Huang, et al. From generation to judgment: Opportunities and challenges of llm-as-a- judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791. 2025

work page 2025
[46]

Anderson, L. W., D. R. Krathwohl.A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives: complete edition. Addison Wesley Longman, Inc., 2001. 12 Sec. A elaborates on the construction of FINDOCMRE, documenting prompts and settings. Sec. B outlines the evaluation protocol, including configuration and scor...

work page 2001
[47]

We apply geometric constraints to eliminate layout artifacts, discarding images with low resolutions or extreme aspect ratios

work page
[48]

This hybrid approach enables the deduplication of visual content and the removal of uninformative icons or corporate logos

We integrate image similarity computation with OCR to remove repetitive non-data ele- ments. This hybrid approach enables the deduplication of visual content and the removal of uninformative icons or corporate logos

work page
[49]

Figure x

We perform page-level textual indexing verification. By scanning the text of corresponding pages, we retain only images explicitly referenced by narrative markers (e.g., “Figure x”), ensuring that all “Cleaned Figures“ are linked to the document context. A.2 Visual-Centric Generation To reduce textual bias and ground reasoning in visual evidence, we use a...

work page
[50]

Form Serves Function:Strictly adhere to question formats: single_choice, multiple_choice, numerical_precise (Calculations must useexplicit valuesfound directly on charts/tables without ambiguity), numerical_approximate (Reasoning requiresvisual estimation e.g., reading axis height where precise labels are absent), and open_ended (Pure text answer; strictl...

work page
[51]

Evaluate Reasoning, Not Memorization:The primary objective is evaluating deep analytical, reasoning, and synthesis skills

work page
[52]

Blind Stem Principle:The stem is strictly forbidden from mentioningchart_id so users emulate real-world blind queries

work page
[53]

Promote Comprehensive Analysis:Encourage the design of complex questions that require integrating partial information from multiple distinct charts

work page
[54]

Information Silo Principle:All charts must be treated as originating from a fictional, non-public context; do not use external knowledge

work page
[55]

Use relative years (e.g., ’Year 1’, ’Year 2’)

Abstract Time Principle:Avoid real dates. Use relative years (e.g., ’Year 1’, ’Year 2’). If this conflicts with ’Event Anchoring’, the latter takes precedence

work page
[56]

Quantitative Anchor Principle:Answers must be uniquely determined by specific information in the charts, avoiding ambiguous estimation scenarios

work page
[57]

when revenue peaked

Event Anchoring Principle:Prioritize using specific events (e.g., "when revenue peaked") to lock time points across multiple charts

work page
[58]

Year 1"). Understand connections between charts. Step 2: Mine Scenarios: Prioritize

Context-Free Stem Principle:The stem must be clear and unambiguous, ensuring solvability whether the input is the full PDF or filtered images. # Classification Tags When generating each question object, you must also add the following two classification tags. The definitions are strict: 1.5 reasoning_type(Reasoning Type):[Must choose one] • Quantitative C...

work page
[59]

OutputONLYthe required JSON content, with no other explanatory text or information

work page
[60]

answer" and the value as a string. ◦Example for single_choice: {

The format for each question’s answer is a dictionary with the key "answer" and the value as a string. ◦Example for single_choice: {"answer": "C"} ◦Example for multiple_choice: {"answer": "ABD"} ◦ Example for numerical_precise/approximate: {"answer": "12.3"} (No units. Round or format decimals as the question requires). ◦Example for open_ended: {"answer":...

work page
[61]

answer":

The outermost structure must be a list of these dictionary results, in the same order as the questions. A strict reference for the output format is as follows (do not reference the content, only the format): 1[ 2{"answer": "C" }, 3{"answer": "ABD" }, 4{"answer": "12.3" }, 5{"answer": "This is a text answer" } 6] Here is the set of financial questions to b...

work page

[1] [1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., E. Bieber, M. Schaekermann, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Bai, S., K. Chen, X. Liu, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Chen, Z., W. Chen, C. Smiley, et al. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711. 2021

work page 2021

[4] [4]

Ding, J., C. Ding, W. Pang, et al. Cnfinbench: A benchmark for safety and compliance of large language models in finance.arXiv preprint arXiv:2512.09506, 2025

work page arXiv 2025

[5] [5]

Xue, S., X. Li, F. Zhou, et al. Famma: A benchmark for financial domain multilingual multimodal question answering.arXiv preprint arXiv:2410.04526, 2024

work page arXiv 2024

[6] [6]

Karamcheti, S., S. Nair, A. Balakrishna, et al. Prismatic vlms: Investigating the design space of visually- conditioned language models. InForty-first International Conference on Machine Learning. 2024

work page 2024

[7] [7]

Huang, W., H. Liu, M. Guo, et al. Visual hallucinations of multi-modal large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9614–9631. 2024

work page 2024

[8] [8]

GPT-4 Technical Report

Achiam, J., S. Adler, S. Agarwal, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

The claude 3 model family: Opus, sonnet, haiku, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. Anthropic Model Card

work page 2024

[10] [10]

Zeng, A., X. Lv, Q. Zheng, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Guo, D., F. Wu, F. Zhu, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Liu, H., C. Li, Q. Wu, et al. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023

[13] [13]

Zhu, J., W. Wang, Z. Chen, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Fu, C., P. Chen, Y . Shen, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Wang, K., J. Pan, W. Shi, et al. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095–95169, 2024

work page 2024

[16] [16]

Jiang, Y

Zhang, R., D. Jiang, Y . Zhang, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024

work page 2024

[17] [17]

Yue, X., Y . Ni, K. Zhang, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567. 2024

work page 2024

[18] [18]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., H. Bansal, T. Xia, et al. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Masry, A., X. L. Do, J. Q. Tan, et al. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279. 2022. 10

work page 2022

[20] [20]

Chen, L., J. Li, X. Dong, et al. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024

[21] [21]

Liu, Y ., H. Duan, Y . Zhang, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024

[22] [22]

Xu, Z., S. Du, Y . Qi, et al. Chartbench: A benchmark for complex visual reasoning in charts.arXiv preprint arXiv:2312.15915, 2023

work page arXiv 2023

[23] [23]

Liu, F., X. Wang, W. Yao, et al. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1287–1310. 2024

work page 2024

[24] [24]

Karatzas, C

Mathew, M., D. Karatzas, C. Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209. 2021

work page 2021

[25] [25]

Shah, R. S., K. Chawla, D. Eidnani, et al. When flue meets flang: Benchmarks and large pre-trained language model for financial domain. InEMNLP. 2022

work page 2022

[26] [26]

Lu, D., H. Wu, J. Liang, et al. Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark.arXiv preprint arXiv:2302.09432, 2023

work page arXiv 2023

[27] [27]

Li, H., Y . Cao, Y . Yu, et al. Investorbench: A benchmark for financial decision-making tasks with llm-based agent.arXiv preprint arXiv:2412.18174, 2024

work page arXiv 2024

[28] [28]

FinanceBench: A New Benchmark for Financial Question Answering

Islam, P., A. Kannappan, D. Kiela, et al. Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Zhang, L., W. Cai, Z. Liu, et al. Fineval: A chinese financial domain knowledge evaluation benchmark for large language models.arXiv preprint arXiv:2308.09975, 2023

work page arXiv 2023

[30] [30]

Xie, Q., W. Han, Z. Chen, et al. Finben: A holistic financial benchmark for large language models. In NeurIPS, pages 95716–95743. 2024

work page 2024

[31] [31]

Lei, Y ., J. Li, D. Cheng, et al. Cfbenchmark: Chinese financial assistant benchmark for large language model.arXiv preprint arXiv:2311.05812, 2023

work page arXiv 2023

[32] [32]

Xu, L., L. Zhu, Y . Wu, et al. Superclue-fin: Graded fine-grained analysis of chinese llms on diverse financial tasks and applications.arXiv preprint arXiv:2404.19063, 2024

work page arXiv 2024

[33] [33]

Openfindata: The open-source financial evaluation dataset for large language models, 2023

OpenCompass Project. Openfindata: The open-source financial evaluation dataset for large language models, 2023. Available athttps://github.com/open-compass/OpenFinData

work page 2023

[34] [34]

Li, J., Y . Zhu, D. Cheng, et al. Cfbenchmark-mm: Chinese financial assistant benchmark for multimodal large language model.arXiv preprint arXiv:2506.13055, 2025

work page arXiv 2025

[35] [35]

Zhang, H

Gan, Z., D. Zhang, H. Li, et al. Mme-finance: A multimodal finance benchmark for expert-level under- standing and reasoning. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12867–12874. 2025

work page 2025

[36] [36]

Luo, J., Z. Kou, L. Yang, et al. Finmme: Benchmark dataset for financial multi-modal reasoning evaluation. arXiv preprint arXiv:2505.24714, 2025

work page arXiv 2025

[37] [37]

Deng, S., H. Peng, J. Xu, et al. Finmr: A knowledge-intensive multimodal benchmark for advanced financial reasoning. InProceedings of the 6th ACM International Conference on AI in Finance, pages 168–176. 2025

work page 2025

[38] [38]

FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting

Zhu, Y ., Y . Jiang, Z. Xu, et al. From comprehension to reasoning: A hierarchical benchmark for automated financial research reporting.arXiv preprint arXiv:2603.19254, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Hu, Y ., Y . Li, P. Liu, et al. Fintsb: A comprehensive and practical benchmark for financial time series forecasting.arXiv preprint arXiv:2502.18834, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Haihong, R

Tang, Z., E. Haihong, R. Li, et al. Finmmdocr: Benchmarking financial multimodal reasoning with scenario awareness, document understanding, and multi-step computation. InProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pages 25858–25866. 2026

work page 2026

[41] [41]

Gpt-5 is here, 2025

Team, G.-. Gpt-5 is here, 2025. 11

work page 2025

[42] [42]

Yang, A., A. Li, B. Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Grok 4.1 fast and agent tools api, 2025

Team, G. Grok 4.1 fast and agent tools api, 2025

work page 2025

[44] [44]

Roukos, T

Papineni, K., S. Roukos, T. Ward, et al. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318. 2002

work page 2002

[45] [45]

Jiang, L

Li, D., B. Jiang, L. Huang, et al. From generation to judgment: Opportunities and challenges of llm-as-a- judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791. 2025

work page 2025

[46] [46]

Anderson, L. W., D. R. Krathwohl.A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives: complete edition. Addison Wesley Longman, Inc., 2001. 12 Sec. A elaborates on the construction of FINDOCMRE, documenting prompts and settings. Sec. B outlines the evaluation protocol, including configuration and scor...

work page 2001

[47] [47]

We apply geometric constraints to eliminate layout artifacts, discarding images with low resolutions or extreme aspect ratios

work page

[48] [48]

This hybrid approach enables the deduplication of visual content and the removal of uninformative icons or corporate logos

We integrate image similarity computation with OCR to remove repetitive non-data ele- ments. This hybrid approach enables the deduplication of visual content and the removal of uninformative icons or corporate logos

work page

[49] [49]

Figure x

We perform page-level textual indexing verification. By scanning the text of corresponding pages, we retain only images explicitly referenced by narrative markers (e.g., “Figure x”), ensuring that all “Cleaned Figures“ are linked to the document context. A.2 Visual-Centric Generation To reduce textual bias and ground reasoning in visual evidence, we use a...

work page

[50] [50]

Form Serves Function:Strictly adhere to question formats: single_choice, multiple_choice, numerical_precise (Calculations must useexplicit valuesfound directly on charts/tables without ambiguity), numerical_approximate (Reasoning requiresvisual estimation e.g., reading axis height where precise labels are absent), and open_ended (Pure text answer; strictl...

work page

[51] [51]

Evaluate Reasoning, Not Memorization:The primary objective is evaluating deep analytical, reasoning, and synthesis skills

work page

[52] [52]

Blind Stem Principle:The stem is strictly forbidden from mentioningchart_id so users emulate real-world blind queries

work page

[53] [53]

Promote Comprehensive Analysis:Encourage the design of complex questions that require integrating partial information from multiple distinct charts

work page

[54] [54]

Information Silo Principle:All charts must be treated as originating from a fictional, non-public context; do not use external knowledge

work page

[55] [55]

Use relative years (e.g., ’Year 1’, ’Year 2’)

Abstract Time Principle:Avoid real dates. Use relative years (e.g., ’Year 1’, ’Year 2’). If this conflicts with ’Event Anchoring’, the latter takes precedence

work page

[56] [56]

Quantitative Anchor Principle:Answers must be uniquely determined by specific information in the charts, avoiding ambiguous estimation scenarios

work page

[57] [57]

when revenue peaked

Event Anchoring Principle:Prioritize using specific events (e.g., "when revenue peaked") to lock time points across multiple charts

work page

[58] [58]

Year 1"). Understand connections between charts. Step 2: Mine Scenarios: Prioritize

Context-Free Stem Principle:The stem must be clear and unambiguous, ensuring solvability whether the input is the full PDF or filtered images. # Classification Tags When generating each question object, you must also add the following two classification tags. The definitions are strict: 1.5 reasoning_type(Reasoning Type):[Must choose one] • Quantitative C...

work page

[59] [59]

OutputONLYthe required JSON content, with no other explanatory text or information

work page

[60] [60]

answer" and the value as a string. ◦Example for single_choice: {

The format for each question’s answer is a dictionary with the key "answer" and the value as a string. ◦Example for single_choice: {"answer": "C"} ◦Example for multiple_choice: {"answer": "ABD"} ◦ Example for numerical_precise/approximate: {"answer": "12.3"} (No units. Round or format decimals as the question requires). ◦Example for open_ended: {"answer":...

work page

[61] [61]

answer":

The outermost structure must be a list of these dictionary results, in the same order as the questions. A strict reference for the output format is as follows (do not reference the content, only the format): 1[ 2{"answer": "C" }, 3{"answer": "ABD" }, 4{"answer": "12.3" }, 5{"answer": "This is a text answer" } 6] Here is the set of financial questions to b...

work page