FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

Ao Wang; Ben Wan; Guiguang Ding; Hui Chen; Ke Zhang; Leqi Shen; Sicheng Zhao; Tongxuan Liu; Yan Feng; Zihan Tang

arxiv: 2605.17447 · v1 · pith:6JIUDJHVnew · submitted 2026-05-17 · 💻 cs.CV · cs.CL

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

Zihan Tang , Leqi Shen , Hui Chen , Ao Wang , Ben Wan , Yan Feng , Ke Zhang , Sicheng Zhao

show 2 more authors

Tongxuan Liu Guiguang Ding

This is my paper

Pith reviewed 2026-05-20 14:49 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords FastOCRKV cache pruningvision-language modelsOCRdocument parsingdynamic visual fixationinference accelerationtraining-free pruning

0 comments

The pith

FastOCR recasts global KV cache pruning as local dynamic selection by exploiting gradual shifts in visual attention during OCR decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that although document images pack many visual tokens, vision-language models attend to them in a temporally sparse pattern that moves gradually from one decoding step to the next, much like a human eye moving across a page. Existing methods that permanently discard tokens during the initial pass destroy too much character and layout information on dense text, but FastOCR avoids eviction by adjusting only which tokens the model looks at in each step. Two modules do the work: one picks the most relevant tokens from a few focal layers, and the other reuses the prior step's selection to warm-start the next. Because nothing is removed from the cache, accuracy stays close to the original model while far fewer tokens are processed per step.

Core claim

The central claim is that the intractable problem of pruning visual tokens from dense documents can be solved by treating attention as a moving local fixation rather than a fixed global set, implemented through Focal-Guided Pruning that selects task-relevant tokens from focal layers at each step and Cross-Step Fixation Reuse that carries the prior fixation forward, all without any permanent token removal from the KV cache.

What carries the argument

Dynamic Visual Fixation, the observed pattern in which model attention concentrates on a small shifting region of the document image across successive decoding steps instead of attending uniformly.

If this is right

The same plug-and-play modules can be added to any of the five tested VLMs of different sizes and architectures without retraining.
Attention latency drops by a factor of three while accuracy remains at 98 percent of the unpruned baseline on Qwen2.5-VL.
Because no tokens are evicted from the cache, the approach sidesteps the irreversible information loss that defeats physical pruning on text-dense images.
The gradual shift in fixation lets each decoding step start from a warm cache state rather than recomputing relevance from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradual-fixation pattern might appear in other dense visual tasks such as table extraction or chart reading, suggesting the method could transfer without modification.
If the focal layers turn out to be consistent across models, future implementations could pre-identify them once and reuse the choice for faster deployment.
Combining this cache-side selection with existing token-compression techniques applied before the cache might produce additive speedups on very long documents.

Load-bearing premise

The model's attention on document images is temporally sparse and shifts gradually across decoding steps in a way that allows safe dynamic selection of tokens without irreversible loss of character or layout information.

What would settle it

Run the same OCR benchmarks with the dynamic selection replaced by random choice of the same fraction of tokens at each step; if accuracy collapses well below the reported retention level, the gradual-fixation premise is required for the method to work.

Figures

Figures reproduced from arXiv: 2605.17447 by Ao Wang, Ben Wan, Guiguang Ding, Hui Chen, Ke Zhang, Leqi Shen, Sicheng Zhao, Tongxuan Liu, Yan Feng, Zihan Tang.

**Figure 2.** Figure 2: Overview of the FastOCR framework. Focal-Guided Pruning (FGP) consists of two sub [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Image attention distribution across layers. (a) Mean image attention ratio for focal vs. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity of the OmniDocBench Overall score to FastOCR’s four hyperparameters on [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of dynamic visual fixation across four consecutive decoding steps ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of focal layers across all 1355 samples in OmniDocBench on Qwen2.5-VL [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FastOCR gives a practical training-free way to prune visual tokens dynamically during VLM decoding for OCR without permanent eviction, but the gradual-shift assumption looks shaky for complex layouts.

read the letter

The paper's core move is to treat OCR attention as temporally sparse and gradually shifting, then use that to do local dynamic pruning instead of global eviction. They keep every token in the KV cache but only attend to a small focal set at each step, with one module picking from focal layers and the other reusing selections from the prior step. This is presented as a plug-and-play addition that works across five VLMs of different sizes.

Referee Report

3 major / 3 minor

Summary. The paper introduces FastOCR, a training-free plug-and-play framework for accelerating document parsing in VLMs. It observes that attention over visual tokens in dense document images is temporally sparse and shifts gradually across decoding steps (analogous to human fixation). The method recasts pruning as a local dynamic problem via two modules: Focal-Guided Pruning, which selects task-relevant tokens from a small set of focal layers at each step, and Cross-Step Fixation Reuse, which warm-starts the current step from the prior fixation. By adjusting attention rather than permanently evicting tokens from the KV cache, the approach avoids irreversible information loss. Experiments claim that on Qwen2.5-VL the method retains 98% of unpruned accuracy while attending to only 5% of visual tokens per decoding step, yielding a 3.0× reduction in attention latency, and generalizes consistently across five VLMs of varying sizes and architectures.

Significance. If the dynamic visual fixation assumption holds across document types, FastOCR would provide a practical, low-overhead acceleration technique for high-token-count OCR tasks that sidesteps the accuracy collapse typical of permanent-eviction pruning methods. The training-free and cache-preserving design is a clear engineering strength, and the reported 3× latency gain at near-full accuracy would be impactful for deployment of VLMs on document understanding workloads.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The central quantitative claim (98% accuracy retention at 5% tokens, 3.0× latency reduction on Qwen2.5-VL) is presented without any baseline comparisons, variance estimates, or statistical significance tests. This information is load-bearing for judging whether the reported gains are robust or merely within noise.
[§3.1] §3.1 (Focal-Guided Pruning): The procedure for identifying the small set of focal layers and for choosing the per-step pruning threshold is described at a high level but lacks an ablation or sensitivity analysis. Because the entire speedup rests on these choices, the manuscript must demonstrate that performance is stable under reasonable variation of these hyperparameters.
[§3 and §5] §3 and §5 (Dynamic Visual Fixation assumption): The method’s safety claim—that gradual fixation shift permits safe dynamic selection without irreversible loss of character or layout information—depends on attention being temporally sparse and locally shifting. No experiments on multi-column pages, tables with cross-references, or figures are reported; such cases could violate the gradual-shift premise and are therefore load-bearing for the central claim.

minor comments (3)

[§3.2] Clarify in §3.2 how Cross-Step Fixation Reuse interacts with the KV cache when the fixation region moves; a small diagram or pseudocode would remove ambiguity.
[Results table] Table 1 (or equivalent results table): report both mean and standard deviation over multiple document samples rather than single-point estimates.
[Conclusion] Add a short paragraph in the conclusion or limitations section explicitly listing document layouts on which the method has not yet been tested.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below and will revise the manuscript to incorporate additional analyses and experiments where needed to strengthen the claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central quantitative claim (98% accuracy retention at 5% tokens, 3.0× latency reduction on Qwen2.5-VL) is presented without any baseline comparisons, variance estimates, or statistical significance tests. This information is load-bearing for judging whether the reported gains are robust or merely within noise.

Authors: We agree that baseline comparisons, variance estimates, and statistical significance tests would strengthen the presentation of our results. In the revised manuscript we will add comparisons to representative static and dynamic pruning baselines from the VLM literature, report mean accuracy and latency with standard deviations over at least five independent evaluation runs on the test sets, and include paired statistical tests to establish that the observed differences are significant. revision: yes
Referee: [§3.1] §3.1 (Focal-Guided Pruning): The procedure for identifying the small set of focal layers and for choosing the per-step pruning threshold is described at a high level but lacks an ablation or sensitivity analysis. Because the entire speedup rests on these choices, the manuscript must demonstrate that performance is stable under reasonable variation of these hyperparameters.

Authors: We acknowledge the importance of demonstrating robustness to these design choices. The revised version will include a dedicated sensitivity study in §3.1 that varies the number of focal layers (1–8) and the pruning ratio (top-1 % to top-10 %), showing that accuracy retention stays above 95 % and latency gains remain consistent across the tested range. revision: yes
Referee: [§3 and §5] §3 and §5 (Dynamic Visual Fixation assumption): The method’s safety claim—that gradual fixation shift permits safe dynamic selection without irreversible loss of character or layout information—depends on attention being temporally sparse and locally shifting. No experiments on multi-column pages, tables with cross-references, or figures are reported; such cases could violate the gradual-shift premise and are therefore load-bearing for the central claim.

Authors: We recognize that explicit validation on complex layouts is necessary to support the core assumption. While our current benchmarks contain diverse documents, we did not isolate multi-column pages, cross-referenced tables, or figures. In the revision we will add a targeted evaluation subsection using suitable examples from these categories and will report accuracy, token usage, and any observed deviations from the gradual-shift behavior, together with a discussion of limitations in §5. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation

full rationale

The paper introduces FastOCR as a training-free plug-and-play module based on an empirical observation of temporally sparse attention in VLMs processing documents. The reported performance numbers (98% accuracy retention at 5% tokens, 3.0× latency reduction on Qwen2.5-VL) are obtained from direct experiments across five models rather than from any closed-form derivation, fitted parameters, or self-citation chain that reduces the claims to inputs by construction. No equations, uniqueness theorems, or ansatzes are presented that equate the pruning strategy or accuracy metrics to quantities defined within the paper itself; the method's correctness is externally falsifiable via standard OCR benchmarks and remains self-contained against those benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on an empirical observation of temporally sparse attention rather than new mathematical axioms or invented physical entities; no free parameters are explicitly named in the abstract.

pith-pipeline@v0.9.0 · 5878 in / 1078 out tokens · 55529 ms · 2026-05-20T14:49:21.525242+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 6.0

RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 5.0

RTPrune introduces a reading-twice inspired two-stage pruning technique for DeepSeek-OCR that retains 84.25% tokens while delivering 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 4.0

RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowsk...

work page 2022
[2]

Nikolopoulos, Hans Vandierendonck, Deepu John, and Bo Ji

Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, and Bo Ji. HiRED: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Nougat: Neural optical under- standing for academic documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical under- standing for academic documents. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[5]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[7]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[9]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[10]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Donut: Document understanding transformer without OCR

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Donut: Document understanding transformer without OCR. arXiv preprint arXiv:2111.15664, 2021

work page arXiv 2021
[12]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2023

work page 2023
[13]

Pix2Struct: Screenshot parsing as pretraining for visual language understanding

Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. InInternational Conference on Machine Learning (ICML), 2023

work page 2023
[14]

LLaV A-OneVision: Easy visual task transfer.Transactions on Machine Learning Research (TMLR), 2025

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer.Transactions on Machine Learning Research (TMLR), 2025. 10

work page 2025
[15]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICML), 2023

work page 2023
[16]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[17]

InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025. URLhttps://arxiv.org/abs/2512.02498

work page arXiv 2025
[18]

Boosting multimodal large language models with visual tokens withdrawal for rapid inference

Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. Boosting multimodal large language models with visual tokens withdrawal for rapid inference. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025
[19]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[20]

Multi-stage vision token dropping: Towards efficient multimodal large language model

Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024

work page arXiv 2024
[21]

xllm technical report, 2026

Tongxuan Liu, Tao Peng, Peijun Yang, Xiaoyang Zhao, Xiusheng Lu, Weizhe Huang, Zirui Liu, Xiaoyu Chen, Zhiwei Liang, Jun Xiong, Donghe Jin, Minchao Zhang, Jinrong Guo, Yingxu Deng, Xu Zhang, Xianzhe Dong, Siqi Wang, Siyu Wu, Yu Wu, Zihan Tang, Yuting Zeng, Yanshu Wang, Jinguang Liu, Meng Kang, Menxin Li, Yunlong Wang, Yiming Liu, Xiaolong Ma, Yifan Wang, ...

work page arXiv 2026
[22]

Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[23]

KIVI: A tuning-free asymmetric 2bit quantization for KV cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. InICML, Proceedings of Machine Learning Research, pages 32332–32344. PMLR / OpenReview.net, 2024

work page 2024
[24]

Joty, and Enamul Hoque

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics (ACL), 2022

work page 2022
[25]

Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021

work page 2021
[26]

OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, and Conghui He. OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations. InIEEE/CVF Conference on Computer...

work page 2025
[27]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models, 2025. URLhttps://arxiv.org/abs/2502.18443

work page arXiv 2025
[28]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021

work page 2021
[29]

Eye movements in reading and information processing: 20 years of research.Psychological Bulletin, 124(3):372–422, 1998

Keith Rayner. Eye movements in reading and information processing: 20 years of research.Psychological Bulletin, 124(3):372–422, 1998

work page 1998
[30]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024. 11

work page arXiv 2024
[31]

Fastvid: Dynamic density pruning for fast video large language models

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language models.CoRR, abs/2503.11187, 2025

work page arXiv 2025
[32]

Tempme: Video temporal token merging for efficient text-video retrieval

Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, and Guiguang Ding. Tempme: Video temporal token merging for efficient text-video retrieval. InICLR. OpenReview.net, 2025

work page 2025
[33]

Quest: Query-aware sparsity for efficient long-context LLM inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[34]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[35]

RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

Ben Wan, Yan Feng, Zihan Tang, Weizhe Huang, Yuting Zeng, Jia Wang, and Tongxuan Liu. Rt- prune: Reading-twice inspired token pruning for efficient deepseek-ocr inference.arXiv preprint arXiv:2605.00392, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, and Conghui He. MinerU: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, and Xiangyu Zhang. General OCR theory: Towards OCR-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, Yukun Li, et al. DeepSeek-OCR: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Ooco: Latency-disaggregated architecture for online-offline co-locate llm serving, 2025

Siyu Wu, Zihan Tang, Yuting Zeng, Hui Chen, Guiguang Ding, Tongxuan Liu, Ke Zhang, and Hailong Yang. Ooco: Latency-disaggregated architecture for online-offline co-locate llm serving, 2025. URL https://arxiv.org/abs/2511.21862

work page arXiv 2025
[40]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[41]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. PyramidDrop: Accelerating your large vision-language models via pyramid visual redundancy reduction.arXiv preprint arXiv:2410.17247, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

VisionZip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is better but not necessary in vision language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[43]

Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A. Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. SparseVLM: Visual token sparsifi- cation for efficient vision-language model inference. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[44]

Barrett, Zhangyang Wang, and Beidi Chen

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark W. Barrett, Zhangyang Wang, and Beidi Chen. H2O: heavy-hitter oracle for efficient generative inference of large language models. InNeurIPS, 2023

work page 2023
[45]

dominant tokens

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. InInternational Conference on Learning Representations (ICLR), 2024. 12 A Algorithm Algorithm 1 summarizes the full procedure of FastOCR at a single decoding step t. The method maintains two pieces of...

work page arXiv 2024

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowsk...

work page 2022

[2] [2]

Nikolopoulos, Hans Vandierendonck, Deepu John, and Bo Ji

Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, and Bo Ji. HiRED: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Nougat: Neural optical under- standing for academic documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical under- standing for academic documents. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[5] [5]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024

[7] [7]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[9] [9]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[10] [10]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Donut: Document understanding transformer without OCR

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Donut: Document understanding transformer without OCR. arXiv preprint arXiv:2111.15664, 2021

work page arXiv 2021

[12] [12]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2023

work page 2023

[13] [13]

Pix2Struct: Screenshot parsing as pretraining for visual language understanding

Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. InInternational Conference on Machine Learning (ICML), 2023

work page 2023

[14] [14]

LLaV A-OneVision: Easy visual task transfer.Transactions on Machine Learning Research (TMLR), 2025

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer.Transactions on Machine Learning Research (TMLR), 2025. 10

work page 2025

[15] [15]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICML), 2023

work page 2023

[16] [16]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[17] [17]

InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025. URLhttps://arxiv.org/abs/2512.02498

work page arXiv 2025

[18] [18]

Boosting multimodal large language models with visual tokens withdrawal for rapid inference

Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. Boosting multimodal large language models with visual tokens withdrawal for rapid inference. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025

[19] [19]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[20] [20]

Multi-stage vision token dropping: Towards efficient multimodal large language model

Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024

work page arXiv 2024

[21] [21]

xllm technical report, 2026

Tongxuan Liu, Tao Peng, Peijun Yang, Xiaoyang Zhao, Xiusheng Lu, Weizhe Huang, Zirui Liu, Xiaoyu Chen, Zhiwei Liang, Jun Xiong, Donghe Jin, Minchao Zhang, Jinrong Guo, Yingxu Deng, Xu Zhang, Xianzhe Dong, Siqi Wang, Siyu Wu, Yu Wu, Zihan Tang, Yuting Zeng, Yanshu Wang, Jinguang Liu, Meng Kang, Menxin Li, Yunlong Wang, Yiming Liu, Xiaolong Ma, Yifan Wang, ...

work page arXiv 2026

[22] [22]

Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[23] [23]

KIVI: A tuning-free asymmetric 2bit quantization for KV cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. InICML, Proceedings of Machine Learning Research, pages 32332–32344. PMLR / OpenReview.net, 2024

work page 2024

[24] [24]

Joty, and Enamul Hoque

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics (ACL), 2022

work page 2022

[25] [25]

Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021

work page 2021

[26] [26]

OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, and Conghui He. OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations. InIEEE/CVF Conference on Computer...

work page 2025

[27] [27]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models, 2025. URLhttps://arxiv.org/abs/2502.18443

work page arXiv 2025

[28] [28]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021

work page 2021

[29] [29]

Eye movements in reading and information processing: 20 years of research.Psychological Bulletin, 124(3):372–422, 1998

Keith Rayner. Eye movements in reading and information processing: 20 years of research.Psychological Bulletin, 124(3):372–422, 1998

work page 1998

[30] [30]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024. 11

work page arXiv 2024

[31] [31]

Fastvid: Dynamic density pruning for fast video large language models

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language models.CoRR, abs/2503.11187, 2025

work page arXiv 2025

[32] [32]

Tempme: Video temporal token merging for efficient text-video retrieval

Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, and Guiguang Ding. Tempme: Video temporal token merging for efficient text-video retrieval. InICLR. OpenReview.net, 2025

work page 2025

[33] [33]

Quest: Query-aware sparsity for efficient long-context LLM inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning (ICML), 2024

work page 2024

[34] [34]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[35] [35]

RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

Ben Wan, Yan Feng, Zihan Tang, Weizhe Huang, Yuting Zeng, Jia Wang, and Tongxuan Liu. Rt- prune: Reading-twice inspired token pruning for efficient deepseek-ocr inference.arXiv preprint arXiv:2605.00392, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, and Conghui He. MinerU: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, and Xiangyu Zhang. General OCR theory: Towards OCR-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, Yukun Li, et al. DeepSeek-OCR: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Ooco: Latency-disaggregated architecture for online-offline co-locate llm serving, 2025

Siyu Wu, Zihan Tang, Yuting Zeng, Hui Chen, Guiguang Ding, Tongxuan Liu, Ke Zhang, and Hailong Yang. Ooco: Latency-disaggregated architecture for online-offline co-locate llm serving, 2025. URL https://arxiv.org/abs/2511.21862

work page arXiv 2025

[40] [40]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[41] [41]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. PyramidDrop: Accelerating your large vision-language models via pyramid visual redundancy reduction.arXiv preprint arXiv:2410.17247, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

VisionZip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is better but not necessary in vision language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[43] [43]

Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A. Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. SparseVLM: Visual token sparsifi- cation for efficient vision-language model inference. InInternational Conference on Machine Learning (ICML), 2025

work page 2025

[44] [44]

Barrett, Zhangyang Wang, and Beidi Chen

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark W. Barrett, Zhangyang Wang, and Beidi Chen. H2O: heavy-hitter oracle for efficient generative inference of large language models. InNeurIPS, 2023

work page 2023

[45] [45]

dominant tokens

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. InInternational Conference on Learning Representations (ICLR), 2024. 12 A Algorithm Algorithm 1 summarizes the full procedure of FastOCR at a single decoding step t. The method maintains two pieces of...

work page arXiv 2024