arxiv: 2511.14998 · v3 · submitted 2025-11-19 · 💻 cs.CV

FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR

Yueru He , Xueqing Peng , Yupeng Cao , Yan Wang , Lingfei Qian , Haohang Li , Yi Han , Shuyao Wang

show 8 more authors

Ruoyu Xiang Fan Zhang Zhuohan Xie Mingquan Lin Prayag Tiwari Jimin Huang Guojun Xiong Sophia Ananiadou

This is my paper

Pith reviewed 2026-05-17 20:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords financial ocrfact-level verificationmultimodal llmdocument understandingvisual benchmarkcritical error detectionvision language models

0 comments

The pith

Financial OCR models often fail to preserve critical facts like numbers and monetary units despite high lexical accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces FinCriticalED, a fact-centric visual benchmark with 859 real-world financial document pages and 9,481 expert-annotated facts spanning numeric, temporal, monetary unit, reporting entity, and financial concept types. It evaluates 13 OCR pipelines, specialized VLMs, open-source MLLMs, and proprietary MLLMs via a Deterministic-Rule-Guided LLM-as-Judge protocol that checks whether outputs preserve facts in context rather than just lexical matches. Results establish a clear gap between surface lexical accuracy and factual reliability, with numerical values and monetary units proving most vulnerable and critical errors clustering in visually complex mixed-layout pages. Distinct failure patterns appear across model families. The work supplies a testbed focused on evidence fidelity for high-stakes financial document understanding.

Core claim

Strong OCR performance on lexical metrics does not guarantee faithful preservation of decision-critical evidence in financial documents, where small visual errors induce discrete shifts in meaning for numerical values, monetary units, temporal data, reporting entities, and financial concepts.

What carries the argument

The Deterministic-Rule-Guided LLM-as-Judge protocol for structured OCR with fact-level verification that assesses contextual preservation of expert-annotated facts.

If this is right

Lexical similarity alone is insufficient for evaluating factual reliability in financial OCR.
Numerical values and monetary units emerge as the most error-prone fact types.
Critical errors concentrate in visually complex, mixed-layout documents.
Different model families display distinct patterns of factual distortion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that explicitly penalize distortion of numerical and monetary facts could narrow the observed gap.
The fact-centric evaluation method could transfer to other high-stakes domains such as medical or legal records.
Finance practitioners may require supplementary verification layers when using current OCR outputs for decisions.

Load-bearing premise

Expert annotations correctly identify all decision-critical facts and the rule-guided LLM judge measures preservation without introducing systematic bias or missing context-dependent errors.

What would settle it

Re-running the full benchmark suite on a fresh collection of financial pages whose critical facts have been independently verified by multiple domain experts and confirming whether the lexical-versus-factual gap persists at similar magnitudes.

Figures

Figures reproduced from arXiv: 2511.14998 by Fan Zhang, Guojun Xiong, Haohang Li, Jimin Huang, Lingfei Qian, Mingquan Lin, Prayag Tiwari, Ruoyu Xiang, Shuyao Wang, Sophia Ananiadou, Xueqing Peng, Yan Wang, Yi Han, Yueru He, Yupeng Cao, Zhuohan Xie.

**Figure 2.** Figure 2: Challenges in financial OCR and the FinCriticalED solution pipeline. Left: Unlike general OCR with sparse, unimodal text and simple layouts, financial documents contain dense tables, hierarchical structures, and semantically sensitive numeric values that require multimodal alignment and layout-aware reasoning. Right: FinCriticalED addresses these challenges by combining page-level images, rendered and prep… view at source ↗

**Figure 3.** Figure 3: Comparison of OCR and multimodal models on gen [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Annotation interface used in FinCriticalED. Annotators highlight entities directly within HTML while referencing rendered page images for layout validation. exported JSON schema. Second, an inter-annotator consistency check evaluates pairwise agreement between annotators using both token-level and entity-level overlap metrics. Entities with disagreement scores below a fixed confidence threshold are autom… view at source ↗

**Figure 5.** Figure 5: Human alignment with LLM-As-Judge paradigm in high FFA case (FFA=100%) [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Human alignment with LLM-As-Judge paradigm in low FFA case (FFA=61%) [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Recent progress in multimodal large language models (MLLMs) has substantially improved document understanding, yet strong optical character recognition (OCR) performance on surface metrics does not guarantee faithful preservation of decision-critical evidence. This limitation is especially consequential in financial documents, where small visual errors can induce discrete shifts in meaning. To study this gap, we introduce FinCriticalED (Financial Critical Error Detection), a fact-centric visual benchmark for evaluating whether OCR and vision-language systems preserve financially critical evidence beyond lexical similarity. FinCriticalED contains 859 real-world financial document pages with 9,481 expert-annotated facts spanning five critical field types: numeric, temporal, monetary unit, reporting entity, and financial concept. We formulate the task as structured OCR with fact-level verification, and develop a Deterministic-Rule-Guided LLM-as-Judge protocol to assess whether model outputs preserve annotated facts in context. We benchmark 13 systems spanning OCR pipelines, specialized OCR VLMs, open-source MLLMs, and proprietary MLLMs. Results reveal a clear gap between lexical accuracy and factual reliability, with numerical values and monetary units emerging as the most vulnerable fact types, and critical errors concentrating in visually complex, mixed-layout documents with distinct failure patterns across model families. Overall, FinCriticalED provides a rigorous benchmark for trustworthy financial OCR and a practical testbed for evidence fidelity in high-stakes multimodal document understanding. Benchmark and dataset details available at https://the-finai.github.io/FinCriticalED/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives us a useful new fact-level benchmark for financial OCR that shows lexical accuracy often misses critical errors in numbers and money.

read the letter

The core takeaway is that FinCriticalED demonstrates a measurable gap between surface-level OCR performance and actual preservation of decision-critical facts in financial documents, with numeric values and monetary units proving especially fragile. The authors built a dataset of 859 real pages containing 9,481 expert-annotated facts across five targeted field types, then evaluated 13 systems using a deterministic rule-guided LLM judge. That combination of scale, domain focus, and structured verification protocol is the genuinely new piece here. Prior OCR benchmarks stayed mostly lexical; this one forces models to keep the meaning that matters for finance. The results look clean enough on the reported numbers: clear separation between model families, concentration of errors in mixed-layout pages, and consistent vulnerability patterns for the numeric and monetary categories. The construction details and judge templates appear internally consistent, which helps the central claim hold up without obvious circularity. One soft spot is that the paper still needs to show inter-annotator agreement numbers and explicit exclusion criteria for the annotated facts; without those, it is harder to judge how stable the ground truth really is. The LLM judge protocol is deterministic and rule-based, which reduces some risk, but any context-dependent financial nuance that slips the rules could still affect the scores. This work is aimed at people building or evaluating document AI for finance, compliance, or auditing. A reader working on multimodal reliability or high-stakes OCR will get concrete data and a reusable testbed. It is worth sending to peer review because the benchmark itself is new, the evaluation covers a broad set of systems, and the observed gap is reproducible enough to be useful even if later revisions tighten the annotation stats.

Referee Report

2 major / 2 minor

Summary. The paper introduces FinCriticalED, a fact-centric visual benchmark for financial OCR consisting of 859 real-world document pages annotated with 9,481 expert-labeled facts across five types (numeric, temporal, monetary unit, reporting entity, financial concept). It evaluates 13 systems spanning OCR pipelines, specialized VLMs, open-source MLLMs, and proprietary models via a Deterministic-Rule-Guided LLM-as-Judge protocol, reporting a gap between lexical accuracy and factual reliability with particular vulnerabilities for numerical/monetary facts in complex mixed-layout documents.

Significance. If the annotations and judge protocol prove reliable, the benchmark is significant for exposing limitations of surface-level OCR metrics in high-stakes financial settings and for supplying a reproducible testbed focused on evidence fidelity. The public dataset release and detailed construction guidelines are strengths that support further work on trustworthy multimodal document understanding.

major comments (2)

[§3] §3 (Annotation and Dataset Construction): Inter-annotator agreement statistics (e.g., Cohen’s kappa or raw agreement percentages) are not reported for the 9,481 facts. This directly affects confidence in the ground-truth labels that support the central claim of a lexical-vs-factual performance gap.
[§4.3] §4.3 (LLM-as-Judge Validation): The Deterministic-Rule-Guided protocol is described with prompt templates, but no quantitative validation against human judgments on a held-out subset is provided. This leaves open the possibility of systematic bias in fact-preservation scoring for context-dependent monetary or entity facts.

minor comments (2)

[Table 2] Table 2 and Figure 4: Axis labels and legend entries could more explicitly distinguish lexical OCR metrics from the fact-level F1 scores to avoid reader confusion when comparing the reported gaps.
[§5] §5 (Results Discussion): The concentration of errors in “visually complex, mixed-layout documents” is stated qualitatively; adding a quantitative breakdown by layout complexity metric would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. The comments on annotation reliability and judge validation are helpful for increasing transparency. We address each major comment below and indicate the changes we will make.

read point-by-point responses

Referee: [§3] §3 (Annotation and Dataset Construction): Inter-annotator agreement statistics (e.g., Cohen’s kappa or raw agreement percentages) are not reported for the 9,481 facts. This directly affects confidence in the ground-truth labels that support the central claim of a lexical-vs-factual performance gap.

Authors: We agree that explicit inter-annotator agreement metrics would increase confidence in the ground-truth labels. The 9,481 facts were annotated by domain experts using detailed guidelines, with a second expert reviewing a random 20% sample for consistency and resolving any differences through discussion. Although we did not include quantitative agreement statistics in the original submission, we have now computed raw agreement on an overlapping subset of 850 facts labeled independently by two experts, yielding 93% agreement. We will add these statistics, along with a description of the annotation workflow and guidelines, to the revised §3. revision: yes
Referee: [§4.3] §4.3 (LLM-as-Judge Validation): The Deterministic-Rule-Guided protocol is described with prompt templates, but no quantitative validation against human judgments on a held-out subset is provided. This leaves open the possibility of systematic bias in fact-preservation scoring for context-dependent monetary or entity facts.

Authors: We acknowledge the value of empirical validation for the Deterministic-Rule-Guided LLM-as-Judge protocol. The protocol uses explicit deterministic rules per fact type to reduce subjectivity, yet we agree that direct comparison to human judgments is necessary to quantify any residual bias, especially for monetary units and entities. We have performed such a validation on a held-out subset of 250 facts, obtaining 90% agreement between the LLM judge and majority vote of two human experts (with lower agreement on a small number of context-dependent entity facts). We will add a new subsection to §4.3 reporting the validation methodology, results, and error analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a benchmark-creation and empirical-evaluation paper rather than a derivation. The central claims rest on expert-annotated facts, dataset construction details, and a rule-guided LLM judge protocol whose prompts and guidelines are supplied in the manuscript. No equations, fitted parameters, or predictions are defined in terms of the reported gap or factual-reliability metrics; the evaluation protocol is explicitly constructed and documented rather than reduced to self-referential inputs. No load-bearing self-citations or uniqueness theorems appear in the provided text. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the quality of expert annotations and the reliability of the rule-guided LLM judge; no new physical entities or fitted constants are introduced beyond the benchmark design itself.

axioms (2)

domain assumption Expert annotations accurately capture all decision-critical facts in the selected pages
The benchmark treats the 9,481 annotated facts as ground truth for measuring preservation.
domain assumption The Deterministic-Rule-Guided LLM-as-Judge protocol produces unbiased assessments of fact preservation
The evaluation protocol is presented as the method for determining whether outputs preserve annotated facts.

pith-pipeline@v0.9.0 · 5621 in / 1448 out tokens · 53841 ms · 2026-05-17T20:16:27.654679+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate the task as structured OCR with fact-level verification, and develop a Deterministic-Rule-Guided LLM-as-Judge protocol to assess whether model outputs preserve annotated facts in context.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results reveal a clear gap between lexical accuracy and factual reliability, with numerical values and monetary units emerging as the most vulnerable fact types

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 7 internal anchors

[1]

Onechart: Purify the chart structural extrac- tion via one auxiliary token

Jinyue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, and Xi- angyu Zhang. Onechart: Purify the chart structural extrac- tion via one auxiliary token. InProceedings of the 32nd ACM International Conference on Multimedia, pages 147– 155, 2024. 2

work page 2024
[2]

A coefficient of agreement for nominal scales

Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46,

work page
[3]

Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025. 2

work page arXiv 2025
[4]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Joseph L. Fleiss. Measuring nominal scale agreement among many raters.Psychological Bulletin, 76(5):378–382, 1971. 3

work page 1971
[6]

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vis...

work page internal anchor Pith review arXiv 2025
[7]

Mme-finance: A multi- modal finance benchmark for expert-level understanding and reasoning

Ziliang Gan, Dong Zhang, Haohan Li, Yang Wu, Xueyuan Lin, Ji Liu, Haipang Wu, Chaoyou Fu, Zenglin Xu, Rongjunchen Zhang, and Yong Dai. Mme-finance: A multi- modal finance benchmark for expert-level understanding and reasoning. InProceedings of the 33rd ACM International Conference on Multimedia, page 12867–12874, New York, NY , USA, 2025. Association for ...

work page 2025
[8]

google/gemma-3n-E4B-it: Gemma 3N Instruction- Tuned Model, 2025

Google. google/gemma-3n-E4B-it: Gemma 3N Instruction- Tuned Model, 2025. 7

work page 2025
[9]

Chartqa- x: Generating explanations for visual chart reasoning.arXiv preprint arXiv:2504.13275, 2025

Shamanthak Hegde, Pooyan Fazli, and Hasti Seifi. Chartqa- x: Generating explanations for visual chart reasoning.arXiv preprint arXiv:2504.13275, 2025. 2, 3

work page arXiv 2025
[10]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Multi- stage field extraction of financial documents with ocr and compact vision-language models.arXiv preprint arXiv:2510.23066, 2025

Yichao Jin, Yushuo Wang, Qishuai Zhong, Kent Chiu Jin- Chun, Kenneth Zhu Ke, and Donald MacDonald. Multi- stage field extraction of financial documents with ocr and compact vision-language models.arXiv preprint arXiv:2510.23066, 2025. 1

work page arXiv 2025
[12]

Levenshtein

Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.Soviet physics. Doklady, 10:707–710, 1965. 2

work page 1965
[13]

Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension,

Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multi- modal large language models with text-rich visual compre- hension.arXiv preprint arXiv:2404.16790, 2024. 2, 3

work page arXiv 2024
[14]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 2

work page 2004
[15]

arXiv preprint arXiv:2405.14295 (2024) 4, 8, 9, 10 17

Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chun- rui Han, and Xiangyu Zhang. Focus anywhere for fine- grained multi-page document understanding.arXiv preprint arXiv:2405.14295, 2024. 2, 3

work page arXiv 2024
[16]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12), 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12), 2024. 2, 3

work page 2024
[17]

Finmme: Benchmark dataset for financial multi-modal reasoning evaluation.arXiv preprint arXiv:2505.24714,

Junyu Luo, Zhizhuo Kou, Liming Yang, Xiao Luo, Jinsheng Huang, Zhiping Xiao, Jingshu Peng, Chengzhong Liu, Ji- aming Ji, Xuanzhe Liu, Sirui Han, Ming Zhang, and Yike Guo. Finmme: Benchmark dataset for financial multi-modal reasoning evaluation.arXiv preprint arXiv:2505.24714,

work page arXiv
[18]

Id- pleaderboard: A unified leaderboard for intelligent document processing tasks, 2025

Souvik Mandal, Nayancy Gupta, Ashish Talewar, Paras Ahuja, Prathamesh Juvatkar, and Gourinath Banda. Id- pleaderboard: A unified leaderboard for intelligent document processing tasks, 2025. 1

work page 2025
[19]

Longfin: A multimodal document understanding model for long financial domain documents.arXiv preprint arXiv:2401.15050, 2024

Ahmed Masry and Amir Hajian. Longfin: A multimodal document understanding model for long financial domain documents.arXiv preprint arXiv:2401.15050, 2024. 2, 3

work page arXiv 2024
[20]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, R Manmatha, and CV Jawahar. Docvqa: A dataset for vqa on docu- ment images. corr abs/2007.00398 (2020).arXiv preprint arXiv:2007.00398, 2020. 2

work page arXiv 2007
[21]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pages 947–952, 2019. 2, 3

work page 2019
[22]

Dolfin – document-level financial test set for machine translation

Mariam Nakhl ´e, Marco Dinarelli, Raheel Qader, Em- manuelle Esperanc ¸a-Rodier, and Herv´e Blanchon. Dolfin – document-level financial test set for machine translation. arXiv preprint arXiv:2502.03053, 2025. 2

work page arXiv 2025
[23]

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qin- tong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wen- zheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuan- hong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Gpt-5 system card, 2025

OpenAI. Gpt-5 system card, 2025. 2, 6, 7

work page 2025
[25]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations,

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongy- ing Tu, and Conghui He. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annota- tions.arXiv preprint arXiv:2412.0762...

work page arXiv 2024
[26]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics, page 311–318, USA, 2002. Association for Computational Linguistics. 2

work page 2002
[27]

The carbon foot- print of machine learning training will plateau, then shrink

David Patterson, Joseph Gonzalez, Urs H ¨olzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. The carbon foot- print of machine learning training will plateau, then shrink. arXiv preprint arXiv:2204.05149, 2022. 6, 7

work page arXiv 2022
[28]

Multifinben: Benchmarking large language models for multilingual and multimodal financial application.arXiv preprint arXiv:2506.14028, 2025

Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Vincent Jim Zhang, Yuqing Guo, Jeff Zhao, Huan He, Yi Han, Yun Feng, Yuechen Jiang, Yupeng Cao, Haohang Li, Yangyang Yu, Xiaoyu Wang, Penglei Gao, Shengyuan Lin, Keyi Wang, Shanshan Yang, Yilun Zhao, Zhiwei Liu, Peng Lu, Jerry Huang, Suyuchen Wang, Triantafillos Papadop...

work page arXiv 2025
[29]

Exploring ocr capabilities of gpt-4v(ision) : A quantitative and in-depth evaluation, 2023

Yongxin Shi, Dezhi Peng, Wenhui Liao, Zening Lin, Xin- hong Chen, Chongyu Liu, Yuyi Zhang, and Lianwen Jin. Exploring ocr capabilities of gpt-4v(ision) : A quantitative and in-depth evaluation, 2023. 1

work page 2023
[30]

Qwen2.5-vl, 2025

Qwen Team. Qwen2.5-vl, 2025. 6, 7

work page 2025
[31]

Contextual: Evaluating context-sensitive text- rich visual reasoning in large multimodal models.arXiv preprint arXiv:2401.13311, 2024

Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng. Contextual: Evaluating context-sensitive text- rich visual reasoning in large multimodal models.arXiv preprint arXiv:2401.13311, 2024. 2, 3

work page arXiv 2024
[32]

Vary: Scaling up the vision vocabulary for large vision-language models

Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language models.arXiv preprint arXiv:2312.06109,

work page arXiv
[33]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704,

work page internal anchor Pith review arXiv
[34]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek- ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Slow perception: Let’s perceive geometric figures step-by-step.arXiv preprint arXiv:2412.20631, 2025

Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, and Daxin Jiang. Slow perception: Let’s perceive geometric figures step-by-step.arXiv preprint arXiv:2412.20631, 2025. 2

work page arXiv 2025
[36]

Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy,

Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, LianWen Jin, and Junyang Lin. Cc-ocr: A comprehensive and challenging ocr benchmark for eval- uating large multimodal models in literacy.arXiv preprint arXiv:2412.02210, 2024. 2, 3

work page arXiv 2024
[37]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

1,000,000

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neu- big. MMMU-pro: A more robust multi-discipline multi- modal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pa...

work page 2025
[39]

Ground truth HTML that contains special entity tags such as <Number>...</Number> and <Date>...</Date>

work page
[40]

Number" * <Date> ... </Date>→entity type =

Model-generated HTML that was produced from the same image but does not contain those tags. Your goal is to judge how correct the generated HTML is compared with the ground truth HTML. Follow all steps carefully and output only one JSON object as the final result. # Step 1: Normalize the ground truth structure The ground truth HTML contains entity tags th...

work page 2024