FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing
Pith reviewed 2026-05-20 14:49 UTC · model grok-4.3
The pith
FastOCR recasts global KV cache pruning as local dynamic selection by exploiting gradual shifts in visual attention during OCR decoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the intractable problem of pruning visual tokens from dense documents can be solved by treating attention as a moving local fixation rather than a fixed global set, implemented through Focal-Guided Pruning that selects task-relevant tokens from focal layers at each step and Cross-Step Fixation Reuse that carries the prior fixation forward, all without any permanent token removal from the KV cache.
What carries the argument
Dynamic Visual Fixation, the observed pattern in which model attention concentrates on a small shifting region of the document image across successive decoding steps instead of attending uniformly.
If this is right
- The same plug-and-play modules can be added to any of the five tested VLMs of different sizes and architectures without retraining.
- Attention latency drops by a factor of three while accuracy remains at 98 percent of the unpruned baseline on Qwen2.5-VL.
- Because no tokens are evicted from the cache, the approach sidesteps the irreversible information loss that defeats physical pruning on text-dense images.
- The gradual shift in fixation lets each decoding step start from a warm cache state rather than recomputing relevance from scratch.
Where Pith is reading between the lines
- The same gradual-fixation pattern might appear in other dense visual tasks such as table extraction or chart reading, suggesting the method could transfer without modification.
- If the focal layers turn out to be consistent across models, future implementations could pre-identify them once and reuse the choice for faster deployment.
- Combining this cache-side selection with existing token-compression techniques applied before the cache might produce additive speedups on very long documents.
Load-bearing premise
The model's attention on document images is temporally sparse and shifts gradually across decoding steps in a way that allows safe dynamic selection of tokens without irreversible loss of character or layout information.
What would settle it
Run the same OCR benchmarks with the dynamic selection replaced by random choice of the same fraction of tokens at each step; if accuracy collapses well below the reported retention level, the gradual-fixation premise is required for the method to work.
Figures
read the original abstract
Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FastOCR, a training-free plug-and-play framework for accelerating document parsing in VLMs. It observes that attention over visual tokens in dense document images is temporally sparse and shifts gradually across decoding steps (analogous to human fixation). The method recasts pruning as a local dynamic problem via two modules: Focal-Guided Pruning, which selects task-relevant tokens from a small set of focal layers at each step, and Cross-Step Fixation Reuse, which warm-starts the current step from the prior fixation. By adjusting attention rather than permanently evicting tokens from the KV cache, the approach avoids irreversible information loss. Experiments claim that on Qwen2.5-VL the method retains 98% of unpruned accuracy while attending to only 5% of visual tokens per decoding step, yielding a 3.0× reduction in attention latency, and generalizes consistently across five VLMs of varying sizes and architectures.
Significance. If the dynamic visual fixation assumption holds across document types, FastOCR would provide a practical, low-overhead acceleration technique for high-token-count OCR tasks that sidesteps the accuracy collapse typical of permanent-eviction pruning methods. The training-free and cache-preserving design is a clear engineering strength, and the reported 3× latency gain at near-full accuracy would be impactful for deployment of VLMs on document understanding workloads.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The central quantitative claim (98% accuracy retention at 5% tokens, 3.0× latency reduction on Qwen2.5-VL) is presented without any baseline comparisons, variance estimates, or statistical significance tests. This information is load-bearing for judging whether the reported gains are robust or merely within noise.
- [§3.1] §3.1 (Focal-Guided Pruning): The procedure for identifying the small set of focal layers and for choosing the per-step pruning threshold is described at a high level but lacks an ablation or sensitivity analysis. Because the entire speedup rests on these choices, the manuscript must demonstrate that performance is stable under reasonable variation of these hyperparameters.
- [§3 and §5] §3 and §5 (Dynamic Visual Fixation assumption): The method’s safety claim—that gradual fixation shift permits safe dynamic selection without irreversible loss of character or layout information—depends on attention being temporally sparse and locally shifting. No experiments on multi-column pages, tables with cross-references, or figures are reported; such cases could violate the gradual-shift premise and are therefore load-bearing for the central claim.
minor comments (3)
- [§3.2] Clarify in §3.2 how Cross-Step Fixation Reuse interacts with the KV cache when the fixation region moves; a small diagram or pseudocode would remove ambiguity.
- [Results table] Table 1 (or equivalent results table): report both mean and standard deviation over multiple document samples rather than single-point estimates.
- [Conclusion] Add a short paragraph in the conclusion or limitations section explicitly listing document layouts on which the method has not yet been tested.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below and will revise the manuscript to incorporate additional analyses and experiments where needed to strengthen the claims.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central quantitative claim (98% accuracy retention at 5% tokens, 3.0× latency reduction on Qwen2.5-VL) is presented without any baseline comparisons, variance estimates, or statistical significance tests. This information is load-bearing for judging whether the reported gains are robust or merely within noise.
Authors: We agree that baseline comparisons, variance estimates, and statistical significance tests would strengthen the presentation of our results. In the revised manuscript we will add comparisons to representative static and dynamic pruning baselines from the VLM literature, report mean accuracy and latency with standard deviations over at least five independent evaluation runs on the test sets, and include paired statistical tests to establish that the observed differences are significant. revision: yes
-
Referee: [§3.1] §3.1 (Focal-Guided Pruning): The procedure for identifying the small set of focal layers and for choosing the per-step pruning threshold is described at a high level but lacks an ablation or sensitivity analysis. Because the entire speedup rests on these choices, the manuscript must demonstrate that performance is stable under reasonable variation of these hyperparameters.
Authors: We acknowledge the importance of demonstrating robustness to these design choices. The revised version will include a dedicated sensitivity study in §3.1 that varies the number of focal layers (1–8) and the pruning ratio (top-1 % to top-10 %), showing that accuracy retention stays above 95 % and latency gains remain consistent across the tested range. revision: yes
-
Referee: [§3 and §5] §3 and §5 (Dynamic Visual Fixation assumption): The method’s safety claim—that gradual fixation shift permits safe dynamic selection without irreversible loss of character or layout information—depends on attention being temporally sparse and locally shifting. No experiments on multi-column pages, tables with cross-references, or figures are reported; such cases could violate the gradual-shift premise and are therefore load-bearing for the central claim.
Authors: We recognize that explicit validation on complex layouts is necessary to support the core assumption. While our current benchmarks contain diverse documents, we did not isolate multi-column pages, cross-referenced tables, or figures. In the revision we will add a targeted evaluation subsection using suitable examples from these categories and will report accuracy, token usage, and any observed deviations from the gradual-shift behavior, together with a discussion of limitations in §5. revision: yes
Circularity Check
No circularity: empirical framework with independent experimental validation
full rationale
The paper introduces FastOCR as a training-free plug-and-play module based on an empirical observation of temporally sparse attention in VLMs processing documents. The reported performance numbers (98% accuracy retention at 5% tokens, 3.0× latency reduction on Qwen2.5-VL) are obtained from direct experiments across five models rather than from any closed-form derivation, fitted parameters, or self-citation chain that reduces the claims to inputs by construction. No equations, uniqueness theorems, or ansatzes are presented that equate the pruning strategy or accuracy metrics to quantities defined within the paper itself; the method's correctness is externally falsifiable via standard OCR benchmarks and remains self-contained against those benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune introduces a reading-twice inspired two-stage pruning technique for DeepSeek-OCR that retains 84.25% tokens while delivering 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowsk...
work page 2022
-
[2]
Nikolopoulos, Hans Vandierendonck, Deepu John, and Bo Ji
Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, and Bo Ji. HiRED: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2025
work page 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Nougat: Neural optical under- standing for academic documents
Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical under- standing for academic documents. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[5]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision (ECCV), 2024
work page 2024
-
[7]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[9]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[10]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Donut: Document understanding transformer without OCR
Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Donut: Document understanding transformer without OCR. arXiv preprint arXiv:2111.15664, 2021
-
[12]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2023
work page 2023
-
[13]
Pix2Struct: Screenshot parsing as pretraining for visual language understanding
Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. InInternational Conference on Machine Learning (ICML), 2023
work page 2023
-
[14]
LLaV A-OneVision: Easy visual task transfer.Transactions on Machine Learning Research (TMLR), 2025
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer.Transactions on Machine Learning Research (TMLR), 2025. 10
work page 2025
-
[15]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICML), 2023
work page 2023
-
[16]
SnapKV: LLM knows what you are looking for before generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. In Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[17]
Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025. URLhttps://arxiv.org/abs/2512.02498
-
[18]
Boosting multimodal large language models with visual tokens withdrawal for rapid inference
Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. Boosting multimodal large language models with visual tokens withdrawal for rapid inference. InProceedings of the AAAI Conference on Artificial Intelligence, 2025
work page 2025
-
[19]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[20]
Multi-stage vision token dropping: Towards efficient multimodal large language model
Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024
-
[21]
Tongxuan Liu, Tao Peng, Peijun Yang, Xiaoyang Zhao, Xiusheng Lu, Weizhe Huang, Zirui Liu, Xiaoyu Chen, Zhiwei Liang, Jun Xiong, Donghe Jin, Minchao Zhang, Jinrong Guo, Yingxu Deng, Xu Zhang, Xianzhe Dong, Siqi Wang, Siyu Wu, Yu Wu, Zihan Tang, Yuting Zeng, Yanshu Wang, Jinguang Liu, Meng Kang, Menxin Li, Yunlong Wang, Yiming Liu, Xiaolong Ma, Yifan Wang, ...
-
[22]
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[23]
KIVI: A tuning-free asymmetric 2bit quantization for KV cache
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. InICML, Proceedings of Machine Learning Research, pages 32332–32344. PMLR / OpenReview.net, 2024
work page 2024
-
[24]
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics (ACL), 2022
work page 2022
-
[25]
Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021
work page 2021
-
[26]
OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, and Conghui He. OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations. InIEEE/CVF Conference on Computer...
work page 2025
-
[27]
Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models, 2025. URLhttps://arxiv.org/abs/2502.18443
-
[28]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021
work page 2021
-
[29]
Keith Rayner. Eye movements in reading and information processing: 20 years of research.Psychological Bulletin, 124(3):372–422, 1998
work page 1998
-
[30]
Llava-prumerge: Adaptive token reduction for efficient large multimodal models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024. 11
-
[31]
Fastvid: Dynamic density pruning for fast video large language models
Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language models.CoRR, abs/2503.11187, 2025
-
[32]
Tempme: Video temporal token merging for efficient text-video retrieval
Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, and Guiguang Ding. Tempme: Video temporal token merging for efficient text-video retrieval. InICLR. OpenReview.net, 2025
work page 2025
-
[33]
Quest: Query-aware sparsity for efficient long-context LLM inference
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[34]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[35]
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
Ben Wan, Yan Feng, Zihan Tang, Weizhe Huang, Yuting Zeng, Jia Wang, and Tongxuan Liu. Rt- prune: Reading-twice inspired token pruning for efficient deepseek-ocr inference.arXiv preprint arXiv:2605.00392, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
MinerU: An Open-Source Solution for Precise Document Content Extraction
Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, and Conghui He. MinerU: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, and Xiangyu Zhang. General OCR theory: Towards OCR-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
DeepSeek-OCR: Contexts Optical Compression
Haoran Wei, Yaofeng Sun, Yukun Li, et al. DeepSeek-OCR: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Ooco: Latency-disaggregated architecture for online-offline co-locate llm serving, 2025
Siyu Wu, Zihan Tang, Yuting Zeng, Hui Chen, Guiguang Ding, Tongxuan Liu, Ke Zhang, and Hailong Yang. Ooco: Latency-disaggregated architecture for online-offline co-locate llm serving, 2025. URL https://arxiv.org/abs/2511.21862
-
[40]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[41]
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. PyramidDrop: Accelerating your large vision-language models via pyramid visual redundancy reduction.arXiv preprint arXiv:2410.17247, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
VisionZip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is better but not necessary in vision language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[43]
Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A. Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. SparseVLM: Visual token sparsifi- cation for efficient vision-language model inference. InInternational Conference on Machine Learning (ICML), 2025
work page 2025
-
[44]
Barrett, Zhangyang Wang, and Beidi Chen
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark W. Barrett, Zhangyang Wang, and Beidi Chen. H2O: heavy-hitter oracle for efficient generative inference of large language models. InNeurIPS, 2023
work page 2023
-
[45]
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. InInternational Conference on Learning Representations (ICLR), 2024. 12 A Algorithm Algorithm 1 summarizes the full procedure of FastOCR at a single decoding step t. The method maintains two pieces of...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.