RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
Pith reviewed 2026-05-22 10:23 UTC · model grok-4.3
The pith
DeepSeek-OCR can prune visual tokens according to its own two-stage reading pattern to run faster while retaining high accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepSeek-OCR exhibits a distinct two-stage reading trajectory during decoding in which it initially prioritizes the majority of high-norm tokens and subsequently redistributes its attention to the remaining ones. RTPrune follows this trajectory by retaining high-norm visual tokens that hold salient textual and structural information in the first stage, then pairing and merging the leftover tokens through optimal transport theory for efficient feature aggregation in the second stage. The approach adds a dynamic pruning ratio that adapts to token similarity and textual density, producing faster inference while preserving the textual fidelity required for accurate OCR.
What carries the argument
The two-stage reading trajectory that first selects high-norm tokens and then merges the rest via optimal transport theory.
If this is right
- Achieves 99.47 percent accuracy on OmniDocBench while retaining only 84.25 percent of the original tokens.
- Produces 1.23 times faster prefill inference when used with DeepSeek-OCR-Large.
- Outperforms prior token-pruning methods for vision-language models by better preserving textual fidelity.
- Improves the efficiency-accuracy trade-off through a dynamic ratio that responds to textual density.
Where Pith is reading between the lines
- The same two-stage attention shift may exist in other large vision-language models, allowing similar pruning without model-specific retraining.
- Pairing RTPrune with additional compression steps such as quantization could further reduce memory use on long documents.
- The dynamic ratio based on similarity and density could be tested on streaming OCR tasks where text density changes rapidly.
Load-bearing premise
The two-stage reading trajectory observed in DeepSeek-OCR supplies a reliable signal for deciding which tokens can be pruned or merged without losing the textual information needed for accurate OCR.
What would settle it
Applying RTPrune to a new set of documents and measuring whether the character error rate rises above 5 percent while the reported speed gains are still obtained.
Figures
read the original abstract
DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23$\times$ faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RTPrune, a two-stage token pruning method for DeepSeek-OCR motivated by an observed two-stage reading trajectory in the model's decoding process. In stage one, high-norm visual tokens are prioritized; in stage two, remaining tokens are paired and merged via optimal transport. A dynamic pruning ratio adapts to token similarity and textual density. On OmniDocBench with DeepSeek-OCR-Large, the method reports 99.47% accuracy at 84.25% token retention and 1.23× faster prefill.
Significance. If the two-stage trajectory observation and OT merging are shown to preserve OCR-critical features, the work could offer a practical efficiency gain for visual-text compression in OCR VLMs. The dynamic ratio and transport-based aggregation are distinctive elements that address redundancy in long-text OCR settings.
major comments (3)
- [§3.1–3.2] §3.1–3.2: The central claim rests on the two-stage trajectory (initial high-norm prioritization followed by redistribution), yet no attention-norm histograms, per-layer norm statistics, or stage-transition thresholds are provided to quantify the observation or justify the pruning decisions.
- [§3.3] §3.3: The optimal-transport merging step is load-bearing for fidelity at 84.25% retention, but pair-selection criteria, cost function definition, and any verification that merged features retain character-level or structural information are absent; without these the 99.47% accuracy claim cannot be evaluated.
- [§4 / Table 2] §4 / Table 2: The reported 99.47% accuracy and 1.23× prefill speedup are presented without baselines, ablations on the dynamic ratio, error bars, or cross-layout robustness tests; this directly undermines the state-of-the-art claim given the abstract's limited experimental detail.
minor comments (2)
- [§3.3] Notation for the dynamic pruning ratio and transport cost should be defined explicitly with symbols before first use.
- [Figures] Figure captions for any attention or merging visualizations should state the exact document subset and layer indices shown.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and rigor where appropriate.
read point-by-point responses
-
Referee: [§3.1–3.2] §3.1–3.2: The central claim rests on the two-stage trajectory (initial high-norm prioritization followed by redistribution), yet no attention-norm histograms, per-layer norm statistics, or stage-transition thresholds are provided to quantify the observation or justify the pruning decisions.
Authors: We agree that explicit quantitative support strengthens the motivation. Our analysis of DeepSeek-OCR's decoding process identified the two-stage pattern via attention norm monitoring, but supporting visualizations were omitted from the initial submission. We have added attention-norm histograms, per-layer norm statistics, and stage-transition thresholds to Sections 3.1–3.2 in the revision. revision: yes
-
Referee: [§3.3] §3.3: The optimal-transport merging step is load-bearing for fidelity at 84.25% retention, but pair-selection criteria, cost function definition, and any verification that merged features retain character-level or structural information are absent; without these the 99.47% accuracy claim cannot be evaluated.
Authors: We acknowledge the need for fuller specification of the OT step. Pair selection uses cosine similarity on normalized visual embeddings, and the cost function is the squared Euclidean distance for the transport plan. We have expanded Section 3.3 with the complete formulation and added verification experiments demonstrating preserved character-level accuracy and layout structure. revision: yes
-
Referee: [§4 / Table 2] §4 / Table 2: The reported 99.47% accuracy and 1.23× prefill speedup are presented without baselines, ablations on the dynamic ratio, error bars, or cross-layout robustness tests; this directly undermines the state-of-the-art claim given the abstract's limited experimental detail.
Authors: Table 2 already contains comparisons against prior token-pruning baselines. To further address the comment we have added ablations on the dynamic pruning ratio, error bars from repeated runs, and cross-layout robustness tests on varied OmniDocBench documents. These updates appear in the revised Section 4 and Table 2. revision: partial
Circularity Check
No significant circularity; derivation is empirically motivated and self-contained
full rationale
The paper derives RTPrune from an empirical analysis of DeepSeek-OCR's decoding process revealing a two-stage trajectory (high-norm token prioritization followed by attention redistribution to remaining tokens). This observation directly motivates the first-stage high-norm pruning and second-stage optimal-transport merging without any step reducing to self-definition, fitted inputs called predictions, or load-bearing self-citations. The dynamic pruning ratio adapts to similarity and density as an extension of the insight, and accuracy claims are tied to external experiments on OmniDocBench rather than being tautological with the trajectory description. No equations or uniqueness theorems are shown to collapse the method onto its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- dynamic pruning ratio
axioms (1)
- domain assumption Optimal transport theory provides an effective way to pair and merge remaining visual tokens while preserving feature information.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing
FastOCR dynamically selects a small subset of visual tokens per decoding step using focal-guided pruning and cross-step reuse, retaining 98% accuracy on Qwen2.5-VL while attending to only 5% of tokens and cutting atte...
Reference graph
Works this paper leans on
-
[1]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Chen, S., Guo, X., Li, Y ., Zhang, T., Lin, M., Kuang, D., Zhang, Y ., Ming, L., Zhang, F., Wang, Y ., et al. Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558,
-
[3]
PaddleOCR 3.0 Technical Report
Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y ., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Vision transformers need registers
Darcet, T., Oquab, M., Mairal, J., and Bojanowski, P. Vision transformers need registers. InInternational Conference on Learning Representations, volume 2024, pp. 2632– 2652,
work page 2024
-
[5]
Deng, J., Li, W., Zhou, J. T., and He, Y . Scope: Saliency- coverage oriented token pruning for efficient multimodel llms.arXiv preprint arXiv:2510.24214,
-
[6]
Glm-ocr technical report.arXiv preprint arXiv:2603.10910,
Duan, S., Xue, Y ., Wang, W., Su, Z., Liu, H., Yang, S., Gan, G., Wang, G., Wang, Z., Yan, S., et al. Glm-ocr technical report.arXiv preprint arXiv:2603.10910,
-
[7]
N \” uwa: Mending the spatial integrity torn by vlm token pruning.arXiv preprint arXiv:2602.02951,
Huang, Y ., Ma, F., Shao, Y ., Guo, J., Yu, Z., Cui, L., and Tian, Q. N \” uwa: Mending the spatial integrity torn by vlm token pruning.arXiv preprint arXiv:2602.02951,
-
[8]
Dcp: Dual-cue pruning for efficient large vision-language models
Jiang, L., Zhang, Z., Zeng, Y ., Xie, C., Liu, T., Li, Z., Cheng, L., and Xu, X. Dcp: Dual-cue pruning for efficient large vision-language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 21202–21215, 2025a. Jiang, Y ., Wu, Q., Lin, W., Yu, W., and Zhou, Y . What kind of visual tokens do we need? traini...
work page 2025
-
[9]
Lappe, A. and Giese, M. A. Register and cls tokens yield a decoupling of local and global features in large vits. arXiv preprint arXiv:2505.05892,
-
[10]
Li, K., Chen, X., Gao, C., Li, Y ., and Chen, X. Bal- anced token pruning: Accelerating vision language models beyond local optimization.arXiv preprint arXiv:2505.22038, 2025a. Li, Y ., Yang, G., Liu, H., Wang, B., and Zhang, C. dots. ocr: Multilingual document layout parsing in a single vision-language model.arXiv preprint arXiv:2512.02498, 2025b. Liu, H...
-
[11]
Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26296–26306, 2024a. Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024b. 10 RTPru...
-
[12]
Poznanski, J., Rangapur, A., Borchardt, J., Dunkelberger, J., Huff, R., Lin, D., Wilhelm, C., Lo, K., and Soldaini, L. olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a. Poznanski, J., Soldaini, L., and Lo, K. olmocr 2: Unit test rewards for document ocr, 2025b. URL https: //arxiv.org/abs/25...
-
[13]
Sobel, I., Feldman, G., et al. A 3x3 isotropic gradient oper- ator for image processing (1968).a talk at the Stanford Artificial Intelligence Project,
work page 1968
-
[14]
Taghadouini, S., Cavaill`es, A., and Aubertin, B. Lightonocr: A 1b end-to-end multilingual vision-language model for state-of-the-art ocr.arXiv preprint arXiv:2601.14251,
-
[15]
FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing
URL https: //arxiv.org/abs/2605.17447. Tong, J., Jin, W., Qin, P., Li, A., Zou, Y ., Li, Y ., Li, Y ., and Li, R. Flowcut: Rethinking redundancy via information flow for efficient vision-language models.Advances in Neural Information Processing Systems, 38:94946–94973,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
DeepSeek-OCR: Contexts Optical Compression
URL https://arxiv.org/ abs/2510.18234. Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Stop looking for important tokens in multimodal language models: Duplication matters more
Wen, Z., Gao, Y ., Wang, S., Zhang, J., Zhang, Q., Li, W., He, C., and Zhang, L. Stop looking for important tokens in multimodal language models: Duplication matters more. arXiv preprint arXiv:2502.11494,
-
[18]
Yao, L., Xing, L., Shi, Y ., Li, S., Liu, Y ., Dong, Y ., Zhang, Y .-F., Li, L., Dong, Q., Dong, X., et al. Towards efficient multimodal large language models: A survey on token compression.TechRxiv, 2026(0112),
work page 2026
-
[19]
URL https://www.techrxiv.org/doi/abs/10
doi: 10.36227/techrxiv.176823010.07236701/v1. URL https://www.techrxiv.org/doi/abs/10. 36227/techrxiv.176823010.07236701/v1. Ye, W., Wu, Q., Lin, W., and Zhou, Y . Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 39, pp. 22128– 22136,
-
[20]
Zhang, K., Li, B., Zhang, P., Pu, F., Cahyono, J. A., Hu, K., Liu, S., Zhang, Y ., Yang, J., Li, C., et al. Lmms- eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 881–916, 2025a. Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y ., Pan, J., She, Q., and Zhang, S. Beyo...
-
[21]
models. Following visualization in CDPruner (Zhang et al., 2025b), Figure 7 visualizes the correlation between the input prompt text and the image embeddings processed by different CLIP. Compared to CLIP- ViT-L/14-336px, CLIP-ViT-B/16 inherently exhibits weaker multi-modal alignment. This alignment is further attenuated as the CLIP model within DeepEncode...
work page 2025
-
[22]
and a shared expert (intermediate dimension 1792), activating only 570M parameters per inference to balance capacity and efficiency. A core optimization lies in dynamic token adaptation: aligned with DeepEncoder’s compression ratio, the decoder adjusts token counts layer-wise during decoding, reducing self-attention’s n2 complexity. Combined with graph-ba...
work page 2025
-
[23]
:Ocean-OCR is a state-of-the-art OCR benchmark tailored for evaluating advanced document understanding capabilities, including complex layout parsing, multi-modal content recognition (e.g., charts, diagrams, and handwritten text), and cross-language OCR performance. The benchmark features a diverse dataset of documents from various domains (e.g., academia...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.