RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

Ben Wan; Jia Wang; Tongxuan Liu; Weizhe Huang; Yan Feng; Yuting Zeng; Zihan Tang

arxiv: 2605.00392 · v3 · pith:G4CSUWHDnew · submitted 2026-05-01 · 💻 cs.CV · cs.LG

RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

Ben Wan , Yan Feng , Zihan Tang , Weizhe Huang , Yuting Zeng , Jia Wang , Tongxuan Liu This is my paper

Pith reviewed 2026-05-22 10:23 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords token pruningDeepSeek-OCRvisual tokensOCR inferenceoptimal transporttwo-stage pruningefficiency optimizationdocument understanding

0 comments

The pith

DeepSeek-OCR can prune visual tokens according to its own two-stage reading pattern to run faster while retaining high accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the decoding process inside DeepSeek-OCR and identifies a consistent two-stage pattern: the model first focuses on the majority of high-norm tokens that carry the main textual and structural cues, then shifts attention to the remaining tokens. This pattern directly inspires RTPrune, a pruning method that keeps the high-norm tokens in the first stage and merges the others in the second stage by pairing them according to optimal transport to aggregate features. A dynamic pruning ratio adjusts how many tokens to drop based on their similarity and the density of the text. If the method works as described, OCR systems could process long documents with substantially fewer visual tokens and less computation, yet still deliver accurate text extraction on benchmarks such as OmniDocBench.

Core claim

DeepSeek-OCR exhibits a distinct two-stage reading trajectory during decoding in which it initially prioritizes the majority of high-norm tokens and subsequently redistributes its attention to the remaining ones. RTPrune follows this trajectory by retaining high-norm visual tokens that hold salient textual and structural information in the first stage, then pairing and merging the leftover tokens through optimal transport theory for efficient feature aggregation in the second stage. The approach adds a dynamic pruning ratio that adapts to token similarity and textual density, producing faster inference while preserving the textual fidelity required for accurate OCR.

What carries the argument

The two-stage reading trajectory that first selects high-norm tokens and then merges the rest via optimal transport theory.

If this is right

Achieves 99.47 percent accuracy on OmniDocBench while retaining only 84.25 percent of the original tokens.
Produces 1.23 times faster prefill inference when used with DeepSeek-OCR-Large.
Outperforms prior token-pruning methods for vision-language models by better preserving textual fidelity.
Improves the efficiency-accuracy trade-off through a dynamic ratio that responds to textual density.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage attention shift may exist in other large vision-language models, allowing similar pruning without model-specific retraining.
Pairing RTPrune with additional compression steps such as quantization could further reduce memory use on long documents.
The dynamic ratio based on similarity and density could be tested on streaming OCR tasks where text density changes rapidly.

Load-bearing premise

The two-stage reading trajectory observed in DeepSeek-OCR supplies a reliable signal for deciding which tokens can be pruned or merged without losing the textual information needed for accurate OCR.

What would settle it

Applying RTPrune to a new set of documents and measuring whether the character error rate rises above 5 percent while the reported speed gains are still obtained.

Figures

Figures reproduced from arXiv: 2605.00392 by Ben Wan, Jia Wang, Tongxuan Liu, Weizhe Huang, Yan Feng, Yuting Zeng, Zihan Tang.

**Figure 1.** Figure 1: Performance and efficiency. (left) Our RTPrune consistently outperforms prior token pruning methods on DeepSeekOCR, retaining over 97.88% of accuracy with 84% of visual tokens on olmOCR-Bench. (right) Our RTPrune reduces GFLOPs by nearly 15.29% and prefill time by nearly 18.90% on OmniDocBench when maintaining 99.47% accuracy. nition (OCR) tasks. DeepSeek-OCR (Wei et al., 2025) not only overcomes this … view at source ↗

**Figure 2.** Figure 2: Comparison of different token pruning methods on DeepSeek-OCR-Base. The patches highlighted in blue are pruned and the text highlighted in red indicates discrepancies with the ground truth. While methods based on original image, attention scores, textual relevance, or inter-token similarity fail to generate text accurately, our approach precisely captures tokens containing critical textual information, lea… view at source ↗

**Figure 2.** Figure 2: Overview of RTPrune. Our framework dynamically determines the pruning ratio by evaluating inter-token similarity and image-level textual density. The process then bifurcates into dominant token selection via embedding ℓ2-norms and residual information integration through optimal transport-based merging. As a training-free and model-agnostic approach, RTPrune mimics the dual-pass reading behavior of the LLM… view at source ↗

**Figure 3.** Figure 3: Top-K Intersection Ratio (TIR) between high-norm visual embeddings and LLM high-attention viusal tokens during prefill, where the violin plot shows the TIR with high-attention tokens at each individual LLM layer, while the line plot reports the TIR with the union of high-attention tokens from the current and all previous layers, evaluated on OmniDocBench (10% subset) by DeepSeek-OCR-Base. of high-norm toke… view at source ↗

**Figure 3.** Figure 3: Comparison of different token pruning methods on DeepSeek-OCR-Base. The patches highlighted in blue are pruned and the text highlighted in red indicates discrepancies with the ground truth. While methods based on original image, attention scores, textual relevance, or inter-token similarity fail to generate text accurately, our approach precisely captures tokens containing critical textual information, lea… view at source ↗

**Figure 4.** Figure 4: Overview of RTPrune. Our framework dynamically determines the pruning ratio by evaluating inter-token similarity and image-level textual density. The process then bifurcates into dominant token selection via embedding ℓ2-norms and residual information integration through optimal transport-based merging. As a training-free and model-agnostic approach, RTPrune mimics the dual-pass reading behavior of the LLM… view at source ↗

**Figure 4.** Figure 4: Top-K Intersection Ratio (TIR) between high-norm visual embeddings and LLM high-attention viusal tokens during prefill, where the violin plot shows the TIR with high-attention tokens at each individual LLM layer, while the line plot reports the TIR with the union of high-attention tokens from the current and all previous layers, evaluated on OmniDocBench (10% subset) by DeepSeek-OCR-Base. information, whil… view at source ↗

**Figure 5.** Figure 5: Performance comparison of various pruning methods with dynamic pruning strategy on ocean-OCR Benchmark using DeepSeek-OCR-Gundam across multiple noteworthy OCR abilities. E, F, P, R, B, and M are the abbreviations for Edit Distance, F1- Score, Precision, Recall, BLEU, and METEOR respectively. For Edit Distance, the plotted score is computed with xafter = 1−xbefore for better visualization. olmOCR-Bench. W… view at source ↗

**Figure 6.** Figure 6: Visualizations of visual tokens redundancy on DeepSeek-OCR-Base. The patch highlighted in blue is pruned and the patches highlighted in red are kept. A.2. LLM Attention-based Methods To explore the significance of visual tokens in terms of the attention they receive within the language model, we conducted experiments by pruning the bottom 25% of visual tokens based on attention scores at each layer. As sho… view at source ↗

**Figure 7.** Figure 7: Visualizations of textual relevance in different CLIP model. The color gradient indicates the degree of relevance, where cooler blue tones represent lower correlation and warmer red tones denote higher relevance. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23$\times$ faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RTPrune, a two-stage token pruning method for DeepSeek-OCR motivated by an observed two-stage reading trajectory in the model's decoding process. In stage one, high-norm visual tokens are prioritized; in stage two, remaining tokens are paired and merged via optimal transport. A dynamic pruning ratio adapts to token similarity and textual density. On OmniDocBench with DeepSeek-OCR-Large, the method reports 99.47% accuracy at 84.25% token retention and 1.23× faster prefill.

Significance. If the two-stage trajectory observation and OT merging are shown to preserve OCR-critical features, the work could offer a practical efficiency gain for visual-text compression in OCR VLMs. The dynamic ratio and transport-based aggregation are distinctive elements that address redundancy in long-text OCR settings.

major comments (3)

[§3.1–3.2] §3.1–3.2: The central claim rests on the two-stage trajectory (initial high-norm prioritization followed by redistribution), yet no attention-norm histograms, per-layer norm statistics, or stage-transition thresholds are provided to quantify the observation or justify the pruning decisions.
[§3.3] §3.3: The optimal-transport merging step is load-bearing for fidelity at 84.25% retention, but pair-selection criteria, cost function definition, and any verification that merged features retain character-level or structural information are absent; without these the 99.47% accuracy claim cannot be evaluated.
[§4 / Table 2] §4 / Table 2: The reported 99.47% accuracy and 1.23× prefill speedup are presented without baselines, ablations on the dynamic ratio, error bars, or cross-layout robustness tests; this directly undermines the state-of-the-art claim given the abstract's limited experimental detail.

minor comments (2)

[§3.3] Notation for the dynamic pruning ratio and transport cost should be defined explicitly with symbols before first use.
[Figures] Figure captions for any attention or merging visualizations should state the exact document subset and layer indices shown.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and rigor where appropriate.

read point-by-point responses

Referee: [§3.1–3.2] §3.1–3.2: The central claim rests on the two-stage trajectory (initial high-norm prioritization followed by redistribution), yet no attention-norm histograms, per-layer norm statistics, or stage-transition thresholds are provided to quantify the observation or justify the pruning decisions.

Authors: We agree that explicit quantitative support strengthens the motivation. Our analysis of DeepSeek-OCR's decoding process identified the two-stage pattern via attention norm monitoring, but supporting visualizations were omitted from the initial submission. We have added attention-norm histograms, per-layer norm statistics, and stage-transition thresholds to Sections 3.1–3.2 in the revision. revision: yes
Referee: [§3.3] §3.3: The optimal-transport merging step is load-bearing for fidelity at 84.25% retention, but pair-selection criteria, cost function definition, and any verification that merged features retain character-level or structural information are absent; without these the 99.47% accuracy claim cannot be evaluated.

Authors: We acknowledge the need for fuller specification of the OT step. Pair selection uses cosine similarity on normalized visual embeddings, and the cost function is the squared Euclidean distance for the transport plan. We have expanded Section 3.3 with the complete formulation and added verification experiments demonstrating preserved character-level accuracy and layout structure. revision: yes
Referee: [§4 / Table 2] §4 / Table 2: The reported 99.47% accuracy and 1.23× prefill speedup are presented without baselines, ablations on the dynamic ratio, error bars, or cross-layout robustness tests; this directly undermines the state-of-the-art claim given the abstract's limited experimental detail.

Authors: Table 2 already contains comparisons against prior token-pruning baselines. To further address the comment we have added ablations on the dynamic pruning ratio, error bars from repeated runs, and cross-layout robustness tests on varied OmniDocBench documents. These updates appear in the revised Section 4 and Table 2. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is empirically motivated and self-contained

full rationale

The paper derives RTPrune from an empirical analysis of DeepSeek-OCR's decoding process revealing a two-stage trajectory (high-norm token prioritization followed by attention redistribution to remaining tokens). This observation directly motivates the first-stage high-norm pruning and second-stage optimal-transport merging without any step reducing to self-definition, fitted inputs called predictions, or load-bearing self-citations. The dynamic pruning ratio adapts to similarity and density as an extension of the insight, and accuracy claims are tied to external experiments on OmniDocBench rather than being tautological with the trajectory description. No equations or uniqueness theorems are shown to collapse the method onto its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the observed two-stage trajectory as a pruning signal and on optimal transport as a suitable merging operator; no new entities are postulated.

free parameters (1)

dynamic pruning ratio
Adapts to token similarity and textual density; value is not stated and appears chosen or tuned for the OCR task.

axioms (1)

domain assumption Optimal transport theory provides an effective way to pair and merge remaining visual tokens while preserving feature information.
Invoked directly for the second-stage aggregation step.

pith-pipeline@v0.9.0 · 5774 in / 1318 out tokens · 37719 ms · 2026-05-22T10:23:57.770600+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing
cs.CV 2026-05 unverdicted novelty 5.0

FastOCR dynamically selects a small subset of visual tokens per decoding step using focal-guided pruning and cross-step reuse, retaining 98% accuracy on Qwen2.5-VL while attending to only 5% of tokens and cutting atte...

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558,

Chen, S., Guo, X., Li, Y ., Zhang, T., Lin, M., Kuang, D., Zhang, Y ., Ming, L., Zhang, F., Wang, Y ., et al. Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558,

work page arXiv
[3]

PaddleOCR 3.0 Technical Report

Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y ., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Vision transformers need registers

Darcet, T., Oquab, M., Mairal, J., and Bojanowski, P. Vision transformers need registers. InInternational Conference on Learning Representations, volume 2024, pp. 2632– 2652,

work page 2024
[5]

T., and He, Y

Deng, J., Li, W., Zhou, J. T., and He, Y . Scope: Saliency- coverage oriented token pruning for efficient multimodel llms.arXiv preprint arXiv:2510.24214,

work page arXiv
[6]

Glm-ocr technical report.arXiv preprint arXiv:2603.10910,

Duan, S., Xue, Y ., Wang, W., Su, Z., Liu, H., Yang, S., Gan, G., Wang, G., Wang, Z., Yan, S., et al. Glm-ocr technical report.arXiv preprint arXiv:2603.10910,

work page arXiv
[7]

N \” uwa: Mending the spatial integrity torn by vlm token pruning.arXiv preprint arXiv:2602.02951,

Huang, Y ., Ma, F., Shao, Y ., Guo, J., Yu, Z., Cui, L., and Tian, Q. N \” uwa: Mending the spatial integrity torn by vlm token pruning.arXiv preprint arXiv:2602.02951,

work page arXiv
[8]

Dcp: Dual-cue pruning for efficient large vision-language models

Jiang, L., Zhang, Z., Zeng, Y ., Xie, C., Liu, T., Li, Z., Cheng, L., and Xu, X. Dcp: Dual-cue pruning for efficient large vision-language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 21202–21215, 2025a. Jiang, Y ., Wu, Q., Lin, W., Yu, W., and Zhou, Y . What kind of visual tokens do we need? traini...

work page 2025
[9]

and Giese, M

Lappe, A. and Giese, M. A. Register and cls tokens yield a decoupling of local and global features in large vits. arXiv preprint arXiv:2505.05892,

work page arXiv
[10]

Bal- anced token pruning: Accelerating vision language models beyond local optimization.arXiv preprint arXiv:2505.22038, 2025a

Li, K., Chen, X., Gao, C., Li, Y ., and Chen, X. Bal- anced token pruning: Accelerating vision language models beyond local optimization.arXiv preprint arXiv:2505.22038, 2025a. Li, Y ., Yang, G., Liu, H., Wang, B., and Zhang, C. dots. ocr: Multilingual document layout parsing in a single vision-language model.arXiv preprint arXiv:2512.02498, 2025b. Liu, H...

work page arXiv
[11]

Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26296–26306, 2024a. Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024b. 10 RTPru...

work page arXiv 2024
[12]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

Poznanski, J., Rangapur, A., Borchardt, J., Dunkelberger, J., Huff, R., Lin, D., Wilhelm, C., Lo, K., and Soldaini, L. olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a. Poznanski, J., Soldaini, L., and Lo, K. olmocr 2: Unit test rewards for document ocr, 2025b. URL https: //arxiv.org/abs/25...

work page arXiv
[13]

A 3x3 isotropic gradient oper- ator for image processing (1968).a talk at the Stanford Artificial Intelligence Project,

Sobel, I., Feldman, G., et al. A 3x3 isotropic gradient oper- ator for image processing (1968).a talk at the Stanford Artificial Intelligence Project,

work page 1968
[14]

Lightonocr: A 1b end-to-end multilingual vision-language model for state-of-the-art ocr.arXiv preprint arXiv:2601.14251,

Taghadouini, S., Cavaill`es, A., and Aubertin, B. Lightonocr: A 1b end-to-end multilingual vision-language model for state-of-the-art ocr.arXiv preprint arXiv:2601.14251,

work page arXiv
[15]

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

URL https: //arxiv.org/abs/2605.17447. Tong, J., Jin, W., Qin, P., Li, A., Zou, Y ., Li, Y ., Li, Y ., and Li, R. Flowcut: Rethinking redundancy via information flow for efficient vision-language models.Advances in Neural Information Processing Systems, 38:94946–94973,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

DeepSeek-OCR: Contexts Optical Compression

URL https://arxiv.org/ abs/2510.18234. Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Stop looking for important tokens in multimodal language models: Duplication matters more

Wen, Z., Gao, Y ., Wang, S., Zhang, J., Zhang, Q., Li, W., He, C., and Zhang, L. Stop looking for important tokens in multimodal language models: Duplication matters more. arXiv preprint arXiv:2502.11494,

work page arXiv
[18]

Towards efficient multimodal large language models: A survey on token compression.TechRxiv, 2026(0112),

Yao, L., Xing, L., Shi, Y ., Li, S., Liu, Y ., Dong, Y ., Zhang, Y .-F., Li, L., Dong, Q., Dong, X., et al. Towards efficient multimodal large language models: A survey on token compression.TechRxiv, 2026(0112),

work page 2026
[19]

URL https://www.techrxiv.org/doi/abs/10

doi: 10.36227/techrxiv.176823010.07236701/v1. URL https://www.techrxiv.org/doi/abs/10. 36227/techrxiv.176823010.07236701/v1. Ye, W., Wu, Q., Lin, W., and Zhou, Y . Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 39, pp. 22128– 22136,

work page doi:10.36227/techrxiv.176823010.07236701/v1
[20]

highlighted tokens

Zhang, K., Li, B., Zhang, P., Pu, F., Cahyono, J. A., Hu, K., Liu, S., Zhang, Y ., Yang, J., Li, C., et al. Lmms- eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 881–916, 2025a. Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y ., Pan, J., She, Q., and Zhang, S. Beyo...

work page arXiv 2025
[21]

Following visualization in CDPruner (Zhang et al., 2025b), Figure 7 visualizes the correlation between the input prompt text and the image embeddings processed by different CLIP

models. Following visualization in CDPruner (Zhang et al., 2025b), Figure 7 visualizes the correlation between the input prompt text and the image embeddings processed by different CLIP. Compared to CLIP- ViT-L/14-336px, CLIP-ViT-B/16 inherently exhibits weaker multi-modal alignment. This alignment is further attenuated as the CLIP model within DeepEncode...

work page 2025
[22]

and a shared expert (intermediate dimension 1792), activating only 570M parameters per inference to balance capacity and efficiency. A core optimization lies in dynamic token adaptation: aligned with DeepEncoder’s compression ratio, the decoder adjusts token counts layer-wise during decoding, reducing self-attention’s n2 complexity. Combined with graph-ba...

work page 2025
[23]

:Ocean-OCR is a state-of-the-art OCR benchmark tailored for evaluating advanced document understanding capabilities, including complex layout parsing, multi-modal content recognition (e.g., charts, diagrams, and handwritten text), and cross-language OCR performance. The benchmark features a diverse dataset of documents from various domains (e.g., academia...

work page 2024

[1] [1]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558,

Chen, S., Guo, X., Li, Y ., Zhang, T., Lin, M., Kuang, D., Zhang, Y ., Ming, L., Zhang, F., Wang, Y ., et al. Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558,

work page arXiv

[3] [3]

PaddleOCR 3.0 Technical Report

Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y ., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Vision transformers need registers

Darcet, T., Oquab, M., Mairal, J., and Bojanowski, P. Vision transformers need registers. InInternational Conference on Learning Representations, volume 2024, pp. 2632– 2652,

work page 2024

[5] [5]

T., and He, Y

Deng, J., Li, W., Zhou, J. T., and He, Y . Scope: Saliency- coverage oriented token pruning for efficient multimodel llms.arXiv preprint arXiv:2510.24214,

work page arXiv

[6] [6]

Glm-ocr technical report.arXiv preprint arXiv:2603.10910,

Duan, S., Xue, Y ., Wang, W., Su, Z., Liu, H., Yang, S., Gan, G., Wang, G., Wang, Z., Yan, S., et al. Glm-ocr technical report.arXiv preprint arXiv:2603.10910,

work page arXiv

[7] [7]

N \” uwa: Mending the spatial integrity torn by vlm token pruning.arXiv preprint arXiv:2602.02951,

Huang, Y ., Ma, F., Shao, Y ., Guo, J., Yu, Z., Cui, L., and Tian, Q. N \” uwa: Mending the spatial integrity torn by vlm token pruning.arXiv preprint arXiv:2602.02951,

work page arXiv

[8] [8]

Dcp: Dual-cue pruning for efficient large vision-language models

Jiang, L., Zhang, Z., Zeng, Y ., Xie, C., Liu, T., Li, Z., Cheng, L., and Xu, X. Dcp: Dual-cue pruning for efficient large vision-language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 21202–21215, 2025a. Jiang, Y ., Wu, Q., Lin, W., Yu, W., and Zhou, Y . What kind of visual tokens do we need? traini...

work page 2025

[9] [9]

and Giese, M

Lappe, A. and Giese, M. A. Register and cls tokens yield a decoupling of local and global features in large vits. arXiv preprint arXiv:2505.05892,

work page arXiv

[10] [10]

Bal- anced token pruning: Accelerating vision language models beyond local optimization.arXiv preprint arXiv:2505.22038, 2025a

Li, K., Chen, X., Gao, C., Li, Y ., and Chen, X. Bal- anced token pruning: Accelerating vision language models beyond local optimization.arXiv preprint arXiv:2505.22038, 2025a. Li, Y ., Yang, G., Liu, H., Wang, B., and Zhang, C. dots. ocr: Multilingual document layout parsing in a single vision-language model.arXiv preprint arXiv:2512.02498, 2025b. Liu, H...

work page arXiv

[11] [11]

Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26296–26306, 2024a. Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024b. 10 RTPru...

work page arXiv 2024

[12] [12]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

Poznanski, J., Rangapur, A., Borchardt, J., Dunkelberger, J., Huff, R., Lin, D., Wilhelm, C., Lo, K., and Soldaini, L. olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a. Poznanski, J., Soldaini, L., and Lo, K. olmocr 2: Unit test rewards for document ocr, 2025b. URL https: //arxiv.org/abs/25...

work page arXiv

[13] [13]

A 3x3 isotropic gradient oper- ator for image processing (1968).a talk at the Stanford Artificial Intelligence Project,

Sobel, I., Feldman, G., et al. A 3x3 isotropic gradient oper- ator for image processing (1968).a talk at the Stanford Artificial Intelligence Project,

work page 1968

[14] [14]

Lightonocr: A 1b end-to-end multilingual vision-language model for state-of-the-art ocr.arXiv preprint arXiv:2601.14251,

Taghadouini, S., Cavaill`es, A., and Aubertin, B. Lightonocr: A 1b end-to-end multilingual vision-language model for state-of-the-art ocr.arXiv preprint arXiv:2601.14251,

work page arXiv

[15] [15]

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

URL https: //arxiv.org/abs/2605.17447. Tong, J., Jin, W., Qin, P., Li, A., Zou, Y ., Li, Y ., Li, Y ., and Li, R. Flowcut: Rethinking redundancy via information flow for efficient vision-language models.Advances in Neural Information Processing Systems, 38:94946–94973,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

DeepSeek-OCR: Contexts Optical Compression

URL https://arxiv.org/ abs/2510.18234. Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Stop looking for important tokens in multimodal language models: Duplication matters more

Wen, Z., Gao, Y ., Wang, S., Zhang, J., Zhang, Q., Li, W., He, C., and Zhang, L. Stop looking for important tokens in multimodal language models: Duplication matters more. arXiv preprint arXiv:2502.11494,

work page arXiv

[18] [18]

Towards efficient multimodal large language models: A survey on token compression.TechRxiv, 2026(0112),

Yao, L., Xing, L., Shi, Y ., Li, S., Liu, Y ., Dong, Y ., Zhang, Y .-F., Li, L., Dong, Q., Dong, X., et al. Towards efficient multimodal large language models: A survey on token compression.TechRxiv, 2026(0112),

work page 2026

[19] [19]

URL https://www.techrxiv.org/doi/abs/10

doi: 10.36227/techrxiv.176823010.07236701/v1. URL https://www.techrxiv.org/doi/abs/10. 36227/techrxiv.176823010.07236701/v1. Ye, W., Wu, Q., Lin, W., and Zhou, Y . Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 39, pp. 22128– 22136,

work page doi:10.36227/techrxiv.176823010.07236701/v1

[20] [20]

highlighted tokens

Zhang, K., Li, B., Zhang, P., Pu, F., Cahyono, J. A., Hu, K., Liu, S., Zhang, Y ., Yang, J., Li, C., et al. Lmms- eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 881–916, 2025a. Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y ., Pan, J., She, Q., and Zhang, S. Beyo...

work page arXiv 2025

[21] [21]

Following visualization in CDPruner (Zhang et al., 2025b), Figure 7 visualizes the correlation between the input prompt text and the image embeddings processed by different CLIP

models. Following visualization in CDPruner (Zhang et al., 2025b), Figure 7 visualizes the correlation between the input prompt text and the image embeddings processed by different CLIP. Compared to CLIP- ViT-L/14-336px, CLIP-ViT-B/16 inherently exhibits weaker multi-modal alignment. This alignment is further attenuated as the CLIP model within DeepEncode...

work page 2025

[22] [22]

and a shared expert (intermediate dimension 1792), activating only 570M parameters per inference to balance capacity and efficiency. A core optimization lies in dynamic token adaptation: aligned with DeepEncoder’s compression ratio, the decoder adjusts token counts layer-wise during decoding, reducing self-attention’s n2 complexity. Combined with graph-ba...

work page 2025

[23] [23]

:Ocean-OCR is a state-of-the-art OCR benchmark tailored for evaluating advanced document understanding capabilities, including complex layout parsing, multi-modal content recognition (e.g., charts, diagrams, and handwritten text), and cross-language OCR performance. The benchmark features a diverse dataset of documents from various domains (e.g., academia...

work page 2024