pith. sign in

arxiv: 2605.00392 · v3 · pith:G4CSUWHDnew · submitted 2026-05-01 · 💻 cs.CV · cs.LG

RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

Pith reviewed 2026-05-22 10:23 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords token pruningDeepSeek-OCRvisual tokensOCR inferenceoptimal transporttwo-stage pruningefficiency optimizationdocument understanding
0
0 comments X

The pith

DeepSeek-OCR can prune visual tokens according to its own two-stage reading pattern to run faster while retaining high accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the decoding process inside DeepSeek-OCR and identifies a consistent two-stage pattern: the model first focuses on the majority of high-norm tokens that carry the main textual and structural cues, then shifts attention to the remaining tokens. This pattern directly inspires RTPrune, a pruning method that keeps the high-norm tokens in the first stage and merges the others in the second stage by pairing them according to optimal transport to aggregate features. A dynamic pruning ratio adjusts how many tokens to drop based on their similarity and the density of the text. If the method works as described, OCR systems could process long documents with substantially fewer visual tokens and less computation, yet still deliver accurate text extraction on benchmarks such as OmniDocBench.

Core claim

DeepSeek-OCR exhibits a distinct two-stage reading trajectory during decoding in which it initially prioritizes the majority of high-norm tokens and subsequently redistributes its attention to the remaining ones. RTPrune follows this trajectory by retaining high-norm visual tokens that hold salient textual and structural information in the first stage, then pairing and merging the leftover tokens through optimal transport theory for efficient feature aggregation in the second stage. The approach adds a dynamic pruning ratio that adapts to token similarity and textual density, producing faster inference while preserving the textual fidelity required for accurate OCR.

What carries the argument

The two-stage reading trajectory that first selects high-norm tokens and then merges the rest via optimal transport theory.

If this is right

  • Achieves 99.47 percent accuracy on OmniDocBench while retaining only 84.25 percent of the original tokens.
  • Produces 1.23 times faster prefill inference when used with DeepSeek-OCR-Large.
  • Outperforms prior token-pruning methods for vision-language models by better preserving textual fidelity.
  • Improves the efficiency-accuracy trade-off through a dynamic ratio that responds to textual density.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage attention shift may exist in other large vision-language models, allowing similar pruning without model-specific retraining.
  • Pairing RTPrune with additional compression steps such as quantization could further reduce memory use on long documents.
  • The dynamic ratio based on similarity and density could be tested on streaming OCR tasks where text density changes rapidly.

Load-bearing premise

The two-stage reading trajectory observed in DeepSeek-OCR supplies a reliable signal for deciding which tokens can be pruned or merged without losing the textual information needed for accurate OCR.

What would settle it

Applying RTPrune to a new set of documents and measuring whether the character error rate rises above 5 percent while the reported speed gains are still obtained.

Figures

Figures reproduced from arXiv: 2605.00392 by Ben Wan, Jia Wang, Tongxuan Liu, Weizhe Huang, Yan Feng, Yuting Zeng, Zihan Tang.

Figure 1
Figure 1. Figure 1: Performance and efficiency. (left) Our RTPrune con￾sistently outperforms prior token pruning methods on DeepSeek￾OCR, retaining over 97.88% of accuracy with 84% of visual to￾kens on olmOCR-Bench. (right) Our RTPrune reduces GFLOPs by nearly 15.29% and prefill time by nearly 18.90% on Om￾niDocBench when maintaining 99.47% accuracy. nition (OCR) tasks. DeepSeek-OCR (Wei et al., 2025) not only overcomes this … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of different token pruning methods on DeepSeek-OCR-Base. The patches highlighted in blue are pruned and the text highlighted in red indicates discrepancies with the ground truth. While methods based on original image, attention scores, textual relevance, or inter-token similarity fail to generate text accurately, our approach precisely captures tokens containing critical textual information, lea… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of RTPrune. Our framework dynamically determines the pruning ratio by evaluating inter-token similarity and image-level textual density. The process then bifurcates into dominant token selection via embedding ℓ2-norms and residual information integration through optimal transport-based merging. As a training-free and model-agnostic approach, RTPrune mimics the dual-pass reading behavior of the LLM… view at source ↗
Figure 3
Figure 3. Figure 3: Top-K Intersection Ratio (TIR) between high-norm visual embeddings and LLM high-attention viusal tokens during prefill, where the violin plot shows the TIR with high-attention tokens at each individual LLM layer, while the line plot reports the TIR with the union of high-attention tokens from the current and all previous layers, evaluated on OmniDocBench (10% subset) by DeepSeek-OCR-Base. of high-norm toke… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of different token pruning methods on DeepSeek-OCR-Base. The patches highlighted in blue are pruned and the text highlighted in red indicates discrepancies with the ground truth. While methods based on original image, attention scores, textual relevance, or inter-token similarity fail to generate text accurately, our approach precisely captures tokens containing critical textual information, lea… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of RTPrune. Our framework dynamically determines the pruning ratio by evaluating inter-token similarity and image-level textual density. The process then bifurcates into dominant token selection via embedding ℓ2-norms and residual information integration through optimal transport-based merging. As a training-free and model-agnostic approach, RTPrune mimics the dual-pass reading behavior of the LLM… view at source ↗
Figure 4
Figure 4. Figure 4: Top-K Intersection Ratio (TIR) between high-norm visual embeddings and LLM high-attention viusal tokens during prefill, where the violin plot shows the TIR with high-attention tokens at each individual LLM layer, while the line plot reports the TIR with the union of high-attention tokens from the current and all previous layers, evaluated on OmniDocBench (10% subset) by DeepSeek-OCR-Base. information, whil… view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison of various pruning methods with dynamic pruning strategy on ocean-OCR Benchmark using DeepSeek-OCR-Gundam across multiple noteworthy OCR abili￾ties. E, F, P, R, B, and M are the abbreviations for Edit Distance, F1- Score, Precision, Recall, BLEU, and METEOR respectively. For Edit Distance, the plotted score is computed with xafter = 1−xbefore for better visualization. olmOCR-Bench. W… view at source ↗
Figure 6
Figure 6. Figure 6: Visualizations of visual tokens redundancy on DeepSeek-OCR-Base. The patch highlighted in blue is pruned and the patches highlighted in red are kept. A.2. LLM Attention-based Methods To explore the significance of visual tokens in terms of the attention they receive within the language model, we conducted experiments by pruning the bottom 25% of visual tokens based on attention scores at each layer. As sho… view at source ↗
Figure 6
Figure 6. Figure 6: Visualizations of visual tokens redundancy on DeepSeek-OCR-Base. The patch highlighted in blue is pruned and the patches highlighted in red are kept. A.2. LLM Attention-based Methods To explore the significance of visual tokens in terms of the attention they receive within the language model, we conducted experiments by pruning the bottom 25% of visual tokens based on attention scores at each layer. As sho… view at source ↗
Figure 7
Figure 7. Figure 7: Visualizations of textual relevance in different CLIP model. The color gradient indicates the degree of relevance, where cooler blue tones represent lower correlation and warmer red tones denote higher relevance. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualizations of textual relevance in different CLIP model. The color gradient indicates the degree of relevance, where cooler blue tones represent lower correlation and warmer red tones denote higher relevance. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23$\times$ faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RTPrune, a two-stage token pruning method for DeepSeek-OCR motivated by an observed two-stage reading trajectory in the model's decoding process. In stage one, high-norm visual tokens are prioritized; in stage two, remaining tokens are paired and merged via optimal transport. A dynamic pruning ratio adapts to token similarity and textual density. On OmniDocBench with DeepSeek-OCR-Large, the method reports 99.47% accuracy at 84.25% token retention and 1.23× faster prefill.

Significance. If the two-stage trajectory observation and OT merging are shown to preserve OCR-critical features, the work could offer a practical efficiency gain for visual-text compression in OCR VLMs. The dynamic ratio and transport-based aggregation are distinctive elements that address redundancy in long-text OCR settings.

major comments (3)
  1. [§3.1–3.2] §3.1–3.2: The central claim rests on the two-stage trajectory (initial high-norm prioritization followed by redistribution), yet no attention-norm histograms, per-layer norm statistics, or stage-transition thresholds are provided to quantify the observation or justify the pruning decisions.
  2. [§3.3] §3.3: The optimal-transport merging step is load-bearing for fidelity at 84.25% retention, but pair-selection criteria, cost function definition, and any verification that merged features retain character-level or structural information are absent; without these the 99.47% accuracy claim cannot be evaluated.
  3. [§4 / Table 2] §4 / Table 2: The reported 99.47% accuracy and 1.23× prefill speedup are presented without baselines, ablations on the dynamic ratio, error bars, or cross-layout robustness tests; this directly undermines the state-of-the-art claim given the abstract's limited experimental detail.
minor comments (2)
  1. [§3.3] Notation for the dynamic pruning ratio and transport cost should be defined explicitly with symbols before first use.
  2. [Figures] Figure captions for any attention or merging visualizations should state the exact document subset and layer indices shown.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and rigor where appropriate.

read point-by-point responses
  1. Referee: [§3.1–3.2] §3.1–3.2: The central claim rests on the two-stage trajectory (initial high-norm prioritization followed by redistribution), yet no attention-norm histograms, per-layer norm statistics, or stage-transition thresholds are provided to quantify the observation or justify the pruning decisions.

    Authors: We agree that explicit quantitative support strengthens the motivation. Our analysis of DeepSeek-OCR's decoding process identified the two-stage pattern via attention norm monitoring, but supporting visualizations were omitted from the initial submission. We have added attention-norm histograms, per-layer norm statistics, and stage-transition thresholds to Sections 3.1–3.2 in the revision. revision: yes

  2. Referee: [§3.3] §3.3: The optimal-transport merging step is load-bearing for fidelity at 84.25% retention, but pair-selection criteria, cost function definition, and any verification that merged features retain character-level or structural information are absent; without these the 99.47% accuracy claim cannot be evaluated.

    Authors: We acknowledge the need for fuller specification of the OT step. Pair selection uses cosine similarity on normalized visual embeddings, and the cost function is the squared Euclidean distance for the transport plan. We have expanded Section 3.3 with the complete formulation and added verification experiments demonstrating preserved character-level accuracy and layout structure. revision: yes

  3. Referee: [§4 / Table 2] §4 / Table 2: The reported 99.47% accuracy and 1.23× prefill speedup are presented without baselines, ablations on the dynamic ratio, error bars, or cross-layout robustness tests; this directly undermines the state-of-the-art claim given the abstract's limited experimental detail.

    Authors: Table 2 already contains comparisons against prior token-pruning baselines. To further address the comment we have added ablations on the dynamic pruning ratio, error bars from repeated runs, and cross-layout robustness tests on varied OmniDocBench documents. These updates appear in the revised Section 4 and Table 2. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is empirically motivated and self-contained

full rationale

The paper derives RTPrune from an empirical analysis of DeepSeek-OCR's decoding process revealing a two-stage trajectory (high-norm token prioritization followed by attention redistribution to remaining tokens). This observation directly motivates the first-stage high-norm pruning and second-stage optimal-transport merging without any step reducing to self-definition, fitted inputs called predictions, or load-bearing self-citations. The dynamic pruning ratio adapts to similarity and density as an extension of the insight, and accuracy claims are tied to external experiments on OmniDocBench rather than being tautological with the trajectory description. No equations or uniqueness theorems are shown to collapse the method onto its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the observed two-stage trajectory as a pruning signal and on optimal transport as a suitable merging operator; no new entities are postulated.

free parameters (1)
  • dynamic pruning ratio
    Adapts to token similarity and textual density; value is not stated and appears chosen or tuned for the OCR task.
axioms (1)
  • domain assumption Optimal transport theory provides an effective way to pair and merge remaining visual tokens while preserving feature information.
    Invoked directly for the second-stage aggregation step.

pith-pipeline@v0.9.0 · 5774 in / 1318 out tokens · 37719 ms · 2026-05-22T10:23:57.770600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

    cs.CV 2026-05 unverdicted novelty 5.0

    FastOCR dynamically selects a small subset of visual tokens per decoding step using focal-guided pruning and cross-step reuse, retaining 98% accuracy on Qwen2.5-VL while attending to only 5% of tokens and cutting atte...

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

  2. [2]

    Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558,

    Chen, S., Guo, X., Li, Y ., Zhang, T., Lin, M., Kuang, D., Zhang, Y ., Ming, L., Zhang, F., Wang, Y ., et al. Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558,

  3. [3]

    PaddleOCR 3.0 Technical Report

    Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y ., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595,

  4. [4]

    Vision transformers need registers

    Darcet, T., Oquab, M., Mairal, J., and Bojanowski, P. Vision transformers need registers. InInternational Conference on Learning Representations, volume 2024, pp. 2632– 2652,

  5. [5]

    T., and He, Y

    Deng, J., Li, W., Zhou, J. T., and He, Y . Scope: Saliency- coverage oriented token pruning for efficient multimodel llms.arXiv preprint arXiv:2510.24214,

  6. [6]

    Glm-ocr technical report.arXiv preprint arXiv:2603.10910,

    Duan, S., Xue, Y ., Wang, W., Su, Z., Liu, H., Yang, S., Gan, G., Wang, G., Wang, Z., Yan, S., et al. Glm-ocr technical report.arXiv preprint arXiv:2603.10910,

  7. [7]

    N \” uwa: Mending the spatial integrity torn by vlm token pruning.arXiv preprint arXiv:2602.02951,

    Huang, Y ., Ma, F., Shao, Y ., Guo, J., Yu, Z., Cui, L., and Tian, Q. N \” uwa: Mending the spatial integrity torn by vlm token pruning.arXiv preprint arXiv:2602.02951,

  8. [8]

    Dcp: Dual-cue pruning for efficient large vision-language models

    Jiang, L., Zhang, Z., Zeng, Y ., Xie, C., Liu, T., Li, Z., Cheng, L., and Xu, X. Dcp: Dual-cue pruning for efficient large vision-language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 21202–21215, 2025a. Jiang, Y ., Wu, Q., Lin, W., Yu, W., and Zhou, Y . What kind of visual tokens do we need? traini...

  9. [9]

    and Giese, M

    Lappe, A. and Giese, M. A. Register and cls tokens yield a decoupling of local and global features in large vits. arXiv preprint arXiv:2505.05892,

  10. [10]

    Bal- anced token pruning: Accelerating vision language models beyond local optimization.arXiv preprint arXiv:2505.22038, 2025a

    Li, K., Chen, X., Gao, C., Li, Y ., and Chen, X. Bal- anced token pruning: Accelerating vision language models beyond local optimization.arXiv preprint arXiv:2505.22038, 2025a. Li, Y ., Yang, G., Liu, H., Wang, B., and Zhang, C. dots. ocr: Multilingual document layout parsing in a single vision-language model.arXiv preprint arXiv:2512.02498, 2025b. Liu, H...

  11. [11]

    Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26296–26306, 2024a. Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024b. 10 RTPru...

  12. [12]

    olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

    Poznanski, J., Rangapur, A., Borchardt, J., Dunkelberger, J., Huff, R., Lin, D., Wilhelm, C., Lo, K., and Soldaini, L. olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a. Poznanski, J., Soldaini, L., and Lo, K. olmocr 2: Unit test rewards for document ocr, 2025b. URL https: //arxiv.org/abs/25...

  13. [13]

    A 3x3 isotropic gradient oper- ator for image processing (1968).a talk at the Stanford Artificial Intelligence Project,

    Sobel, I., Feldman, G., et al. A 3x3 isotropic gradient oper- ator for image processing (1968).a talk at the Stanford Artificial Intelligence Project,

  14. [14]

    Lightonocr: A 1b end-to-end multilingual vision-language model for state-of-the-art ocr.arXiv preprint arXiv:2601.14251,

    Taghadouini, S., Cavaill`es, A., and Aubertin, B. Lightonocr: A 1b end-to-end multilingual vision-language model for state-of-the-art ocr.arXiv preprint arXiv:2601.14251,

  15. [15]

    FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

    URL https: //arxiv.org/abs/2605.17447. Tong, J., Jin, W., Qin, P., Li, A., Zou, Y ., Li, Y ., Li, Y ., and Li, R. Flowcut: Rethinking redundancy via information flow for efficient vision-language models.Advances in Neural Information Processing Systems, 38:94946–94973,

  16. [16]

    DeepSeek-OCR: Contexts Optical Compression

    URL https://arxiv.org/ abs/2510.18234. Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,

  17. [17]

    Stop looking for important tokens in multimodal language models: Duplication matters more

    Wen, Z., Gao, Y ., Wang, S., Zhang, J., Zhang, Q., Li, W., He, C., and Zhang, L. Stop looking for important tokens in multimodal language models: Duplication matters more. arXiv preprint arXiv:2502.11494,

  18. [18]

    Towards efficient multimodal large language models: A survey on token compression.TechRxiv, 2026(0112),

    Yao, L., Xing, L., Shi, Y ., Li, S., Liu, Y ., Dong, Y ., Zhang, Y .-F., Li, L., Dong, Q., Dong, X., et al. Towards efficient multimodal large language models: A survey on token compression.TechRxiv, 2026(0112),

  19. [19]

    URL https://www.techrxiv.org/doi/abs/10

    doi: 10.36227/techrxiv.176823010.07236701/v1. URL https://www.techrxiv.org/doi/abs/10. 36227/techrxiv.176823010.07236701/v1. Ye, W., Wu, Q., Lin, W., and Zhou, Y . Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 39, pp. 22128– 22136,

  20. [20]

    highlighted tokens

    Zhang, K., Li, B., Zhang, P., Pu, F., Cahyono, J. A., Hu, K., Liu, S., Zhang, Y ., Yang, J., Li, C., et al. Lmms- eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 881–916, 2025a. Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y ., Pan, J., She, Q., and Zhang, S. Beyo...

  21. [21]

    Following visualization in CDPruner (Zhang et al., 2025b), Figure 7 visualizes the correlation between the input prompt text and the image embeddings processed by different CLIP

    models. Following visualization in CDPruner (Zhang et al., 2025b), Figure 7 visualizes the correlation between the input prompt text and the image embeddings processed by different CLIP. Compared to CLIP- ViT-L/14-336px, CLIP-ViT-B/16 inherently exhibits weaker multi-modal alignment. This alignment is further attenuated as the CLIP model within DeepEncode...

  22. [22]

    and a shared expert (intermediate dimension 1792), activating only 570M parameters per inference to balance capacity and efficiency. A core optimization lies in dynamic token adaptation: aligned with DeepEncoder’s compression ratio, the decoder adjusts token counts layer-wise during decoding, reducing self-attention’s n2 complexity. Combined with graph-ba...

  23. [23]

    :Ocean-OCR is a state-of-the-art OCR benchmark tailored for evaluating advanced document understanding capabilities, including complex layout parsing, multi-modal content recognition (e.g., charts, diagrams, and handwritten text), and cross-language OCR performance. The benchmark features a diverse dataset of documents from various domains (e.g., academia...