pith. sign in

arxiv: 2605.17447 · v1 · pith:6JIUDJHVnew · submitted 2026-05-17 · 💻 cs.CV · cs.CL

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

Pith reviewed 2026-05-20 14:49 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords FastOCRKV cache pruningvision-language modelsOCRdocument parsingdynamic visual fixationinference accelerationtraining-free pruning
0
0 comments X

The pith

FastOCR recasts global KV cache pruning as local dynamic selection by exploiting gradual shifts in visual attention during OCR decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that although document images pack many visual tokens, vision-language models attend to them in a temporally sparse pattern that moves gradually from one decoding step to the next, much like a human eye moving across a page. Existing methods that permanently discard tokens during the initial pass destroy too much character and layout information on dense text, but FastOCR avoids eviction by adjusting only which tokens the model looks at in each step. Two modules do the work: one picks the most relevant tokens from a few focal layers, and the other reuses the prior step's selection to warm-start the next. Because nothing is removed from the cache, accuracy stays close to the original model while far fewer tokens are processed per step.

Core claim

The central claim is that the intractable problem of pruning visual tokens from dense documents can be solved by treating attention as a moving local fixation rather than a fixed global set, implemented through Focal-Guided Pruning that selects task-relevant tokens from focal layers at each step and Cross-Step Fixation Reuse that carries the prior fixation forward, all without any permanent token removal from the KV cache.

What carries the argument

Dynamic Visual Fixation, the observed pattern in which model attention concentrates on a small shifting region of the document image across successive decoding steps instead of attending uniformly.

If this is right

  • The same plug-and-play modules can be added to any of the five tested VLMs of different sizes and architectures without retraining.
  • Attention latency drops by a factor of three while accuracy remains at 98 percent of the unpruned baseline on Qwen2.5-VL.
  • Because no tokens are evicted from the cache, the approach sidesteps the irreversible information loss that defeats physical pruning on text-dense images.
  • The gradual shift in fixation lets each decoding step start from a warm cache state rather than recomputing relevance from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gradual-fixation pattern might appear in other dense visual tasks such as table extraction or chart reading, suggesting the method could transfer without modification.
  • If the focal layers turn out to be consistent across models, future implementations could pre-identify them once and reuse the choice for faster deployment.
  • Combining this cache-side selection with existing token-compression techniques applied before the cache might produce additive speedups on very long documents.

Load-bearing premise

The model's attention on document images is temporally sparse and shifts gradually across decoding steps in a way that allows safe dynamic selection of tokens without irreversible loss of character or layout information.

What would settle it

Run the same OCR benchmarks with the dynamic selection replaced by random choice of the same fraction of tokens at each step; if accuracy collapses well below the reported retention level, the gradual-fixation premise is required for the method to work.

Figures

Figures reproduced from arXiv: 2605.17447 by Ao Wang, Ben Wan, Guiguang Ding, Hui Chen, Ke Zhang, Leqi Shen, Sicheng Zhao, Tongxuan Liu, Yan Feng, Zihan Tang.

Figure 1
Figure 1. Figure 1: Comparison of FastOCR with existing KV cache pruning methods. FastOCR dynamically [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the FastOCR framework. Focal-Guided Pruning (FGP) consists of two sub [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Image attention distribution across layers. (a) Mean image attention ratio for focal vs. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity of the OmniDocBench Overall score to FastOCR’s four hyperparameters on [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of dynamic visual fixation across four consecutive decoding steps ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of focal layers across all 1355 samples in OmniDocBench on Qwen2.5-VL [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces FastOCR, a training-free plug-and-play framework for accelerating document parsing in VLMs. It observes that attention over visual tokens in dense document images is temporally sparse and shifts gradually across decoding steps (analogous to human fixation). The method recasts pruning as a local dynamic problem via two modules: Focal-Guided Pruning, which selects task-relevant tokens from a small set of focal layers at each step, and Cross-Step Fixation Reuse, which warm-starts the current step from the prior fixation. By adjusting attention rather than permanently evicting tokens from the KV cache, the approach avoids irreversible information loss. Experiments claim that on Qwen2.5-VL the method retains 98% of unpruned accuracy while attending to only 5% of visual tokens per decoding step, yielding a 3.0× reduction in attention latency, and generalizes consistently across five VLMs of varying sizes and architectures.

Significance. If the dynamic visual fixation assumption holds across document types, FastOCR would provide a practical, low-overhead acceleration technique for high-token-count OCR tasks that sidesteps the accuracy collapse typical of permanent-eviction pruning methods. The training-free and cache-preserving design is a clear engineering strength, and the reported 3× latency gain at near-full accuracy would be impactful for deployment of VLMs on document understanding workloads.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central quantitative claim (98% accuracy retention at 5% tokens, 3.0× latency reduction on Qwen2.5-VL) is presented without any baseline comparisons, variance estimates, or statistical significance tests. This information is load-bearing for judging whether the reported gains are robust or merely within noise.
  2. [§3.1] §3.1 (Focal-Guided Pruning): The procedure for identifying the small set of focal layers and for choosing the per-step pruning threshold is described at a high level but lacks an ablation or sensitivity analysis. Because the entire speedup rests on these choices, the manuscript must demonstrate that performance is stable under reasonable variation of these hyperparameters.
  3. [§3 and §5] §3 and §5 (Dynamic Visual Fixation assumption): The method’s safety claim—that gradual fixation shift permits safe dynamic selection without irreversible loss of character or layout information—depends on attention being temporally sparse and locally shifting. No experiments on multi-column pages, tables with cross-references, or figures are reported; such cases could violate the gradual-shift premise and are therefore load-bearing for the central claim.
minor comments (3)
  1. [§3.2] Clarify in §3.2 how Cross-Step Fixation Reuse interacts with the KV cache when the fixation region moves; a small diagram or pseudocode would remove ambiguity.
  2. [Results table] Table 1 (or equivalent results table): report both mean and standard deviation over multiple document samples rather than single-point estimates.
  3. [Conclusion] Add a short paragraph in the conclusion or limitations section explicitly listing document layouts on which the method has not yet been tested.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below and will revise the manuscript to incorporate additional analyses and experiments where needed to strengthen the claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central quantitative claim (98% accuracy retention at 5% tokens, 3.0× latency reduction on Qwen2.5-VL) is presented without any baseline comparisons, variance estimates, or statistical significance tests. This information is load-bearing for judging whether the reported gains are robust or merely within noise.

    Authors: We agree that baseline comparisons, variance estimates, and statistical significance tests would strengthen the presentation of our results. In the revised manuscript we will add comparisons to representative static and dynamic pruning baselines from the VLM literature, report mean accuracy and latency with standard deviations over at least five independent evaluation runs on the test sets, and include paired statistical tests to establish that the observed differences are significant. revision: yes

  2. Referee: [§3.1] §3.1 (Focal-Guided Pruning): The procedure for identifying the small set of focal layers and for choosing the per-step pruning threshold is described at a high level but lacks an ablation or sensitivity analysis. Because the entire speedup rests on these choices, the manuscript must demonstrate that performance is stable under reasonable variation of these hyperparameters.

    Authors: We acknowledge the importance of demonstrating robustness to these design choices. The revised version will include a dedicated sensitivity study in §3.1 that varies the number of focal layers (1–8) and the pruning ratio (top-1 % to top-10 %), showing that accuracy retention stays above 95 % and latency gains remain consistent across the tested range. revision: yes

  3. Referee: [§3 and §5] §3 and §5 (Dynamic Visual Fixation assumption): The method’s safety claim—that gradual fixation shift permits safe dynamic selection without irreversible loss of character or layout information—depends on attention being temporally sparse and locally shifting. No experiments on multi-column pages, tables with cross-references, or figures are reported; such cases could violate the gradual-shift premise and are therefore load-bearing for the central claim.

    Authors: We recognize that explicit validation on complex layouts is necessary to support the core assumption. While our current benchmarks contain diverse documents, we did not isolate multi-column pages, cross-referenced tables, or figures. In the revision we will add a targeted evaluation subsection using suitable examples from these categories and will report accuracy, token usage, and any observed deviations from the gradual-shift behavior, together with a discussion of limitations in §5. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation

full rationale

The paper introduces FastOCR as a training-free plug-and-play module based on an empirical observation of temporally sparse attention in VLMs processing documents. The reported performance numbers (98% accuracy retention at 5% tokens, 3.0× latency reduction on Qwen2.5-VL) are obtained from direct experiments across five models rather than from any closed-form derivation, fitted parameters, or self-citation chain that reduces the claims to inputs by construction. No equations, uniqueness theorems, or ansatzes are presented that equate the pruning strategy or accuracy metrics to quantities defined within the paper itself; the method's correctness is externally falsifiable via standard OCR benchmarks and remains self-contained against those benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on an empirical observation of temporally sparse attention rather than new mathematical axioms or invented physical entities; no free parameters are explicitly named in the abstract.

pith-pipeline@v0.9.0 · 5878 in / 1078 out tokens · 55529 ms · 2026-05-20T14:49:21.525242+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

    cs.CV 2026-05 unverdicted novelty 6.0

    RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.

  2. RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

    cs.CV 2026-05 unverdicted novelty 5.0

    RTPrune introduces a reading-twice inspired two-stage pruning technique for DeepSeek-OCR that retains 84.25% tokens while delivering 99.47% accuracy and 1.23x faster prefill on OmniDocBench.

  3. RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

    cs.CV 2026-05 unverdicted novelty 4.0

    RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowsk...

  2. [2]

    Nikolopoulos, Hans Vandierendonck, Deepu John, and Bo Ji

    Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, and Bo Ji. HiRED: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...

  4. [4]

    Nougat: Neural optical under- standing for academic documents

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical under- standing for academic documents. InInternational Conference on Learning Representations (ICLR), 2024

  5. [5]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

  6. [6]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision (ECCV), 2024

  7. [7]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023

  8. [8]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  9. [9]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  10. [10]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  11. [11]

    Donut: Document understanding transformer without OCR

    Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Donut: Document understanding transformer without OCR. arXiv preprint arXiv:2111.15664, 2021

  12. [12]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2023

  13. [13]

    Pix2Struct: Screenshot parsing as pretraining for visual language understanding

    Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. InInternational Conference on Machine Learning (ICML), 2023

  14. [14]

    LLaV A-OneVision: Easy visual task transfer.Transactions on Machine Learning Research (TMLR), 2025

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer.Transactions on Machine Learning Research (TMLR), 2025. 10

  15. [15]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICML), 2023

  16. [16]

    SnapKV: LLM knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  17. [17]

    InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5

    Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025. URLhttps://arxiv.org/abs/2512.02498

  18. [18]

    Boosting multimodal large language models with visual tokens withdrawal for rapid inference

    Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. Boosting multimodal large language models with visual tokens withdrawal for rapid inference. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

  19. [19]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  20. [20]

    Multi-stage vision token dropping: Towards efficient multimodal large language model

    Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024

  21. [21]

    xllm technical report, 2026

    Tongxuan Liu, Tao Peng, Peijun Yang, Xiaoyang Zhao, Xiusheng Lu, Weizhe Huang, Zirui Liu, Xiaoyu Chen, Zhiwei Liang, Jun Xiong, Donghe Jin, Minchao Zhang, Jinrong Guo, Yingxu Deng, Xu Zhang, Xianzhe Dong, Siqi Wang, Siyu Wu, Yu Wu, Zihan Tang, Yuting Zeng, Yanshu Wang, Jinguang Liu, Meng Kang, Menxin Li, Yunlong Wang, Yiming Liu, Xiaolong Ma, Yifan Wang, ...

  22. [22]

    Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  23. [23]

    KIVI: A tuning-free asymmetric 2bit quantization for KV cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. InICML, Proceedings of Machine Learning Research, pages 32332–32344. PMLR / OpenReview.net, 2024

  24. [24]

    Joty, and Enamul Hoque

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics (ACL), 2022

  25. [25]

    Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021

  26. [26]

    OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations

    Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, and Conghui He. OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations. InIEEE/CVF Conference on Computer...

  27. [27]

    olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

    Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models, 2025. URLhttps://arxiv.org/abs/2502.18443

  28. [28]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021

  29. [29]

    Eye movements in reading and information processing: 20 years of research.Psychological Bulletin, 124(3):372–422, 1998

    Keith Rayner. Eye movements in reading and information processing: 20 years of research.Psychological Bulletin, 124(3):372–422, 1998

  30. [30]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024. 11

  31. [31]

    Fastvid: Dynamic density pruning for fast video large language models

    Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language models.CoRR, abs/2503.11187, 2025

  32. [32]

    Tempme: Video temporal token merging for efficient text-video retrieval

    Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, and Guiguang Ding. Tempme: Video temporal token merging for efficient text-video retrieval. InICLR. OpenReview.net, 2025

  33. [33]

    Quest: Query-aware sparsity for efficient long-context LLM inference

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning (ICML), 2024

  34. [34]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

  35. [35]

    RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

    Ben Wan, Yan Feng, Zihan Tang, Weizhe Huang, Yuting Zeng, Jia Wang, and Tongxuan Liu. Rt- prune: Reading-twice inspired token pruning for efficient deepseek-ocr inference.arXiv preprint arXiv:2605.00392, 2026

  36. [36]

    MinerU: An Open-Source Solution for Precise Document Content Extraction

    Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, and Conghui He. MinerU: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

  37. [37]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

    Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, and Xiangyu Zhang. General OCR theory: Towards OCR-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024

  38. [38]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, Yukun Li, et al. DeepSeek-OCR: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025

  39. [39]

    Ooco: Latency-disaggregated architecture for online-offline co-locate llm serving, 2025

    Siyu Wu, Zihan Tang, Yuting Zeng, Hui Chen, Guiguang Ding, Tongxuan Liu, Ke Zhang, and Hailong Yang. Ooco: Latency-disaggregated architecture for online-offline co-locate llm serving, 2025. URL https://arxiv.org/abs/2511.21862

  40. [40]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

  41. [41]

    PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. PyramidDrop: Accelerating your large vision-language models via pyramid visual redundancy reduction.arXiv preprint arXiv:2410.17247, 2024

  42. [42]

    VisionZip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is better but not necessary in vision language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  43. [43]

    Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A. Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. SparseVLM: Visual token sparsifi- cation for efficient vision-language model inference. InInternational Conference on Machine Learning (ICML), 2025

  44. [44]

    Barrett, Zhangyang Wang, and Beidi Chen

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark W. Barrett, Zhangyang Wang, and Beidi Chen. H2O: heavy-hitter oracle for efficient generative inference of large language models. InNeurIPS, 2023

  45. [45]

    dominant tokens

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. InInternational Conference on Learning Representations (ICLR), 2024. 12 A Algorithm Algorithm 1 summarizes the full procedure of FastOCR at a single decoding step t. The method maintains two pieces of...