Unlimited OCR Works
Pith reviewed 2026-06-26 09:01 UTC · model grok-4.3
The pith
Replacing attention with R-SWA keeps the KV cache constant so OCR can process dozens of pages in one pass under a 32K limit.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By replacing all attention layers in the decoder with Reference Sliding Window Attention, Unlimited OCR maintains a constant KV cache size throughout decoding, allowing transcription of dozens of pages in a single forward pass under a 32K maximum length.
What carries the argument
Reference Sliding Window Attention (R-SWA), an attention mechanism that reduces computation costs while enforcing constant KV cache size for the entire decoding process.
If this is right
- OCR models can now handle multi-page documents without splitting or repeated passes.
- The constant cache removes the progressive slowdown that normally appears as output length grows.
- R-SWA can be swapped into other sequence-to-sequence tasks that require long output sequences.
- The design emulates human working memory by avoiding ever-growing state during copying tasks.
Where Pith is reading between the lines
- Tasks like automatic speech recognition or machine translation could adopt the same attention change to handle long inputs or outputs efficiently.
- If R-SWA truly preserves accuracy, it offers a drop-in replacement for standard attention in any decoder that copies or transcribes text.
- Future work could test whether the constant cache also reduces peak memory enough to run larger models on the same hardware.
Load-bearing premise
That swapping in R-SWA keeps the original OCR accuracy and language-modeling benefits intact even though no accuracy numbers or baseline comparisons are shown.
What would settle it
Run Unlimited OCR and the baseline DeepSeek OCR on the same set of multi-page documents and measure character error rate; if error rate rises sharply with R-SWA, the claim that accuracy is preserved fails.
read the original abstract
Recently, end-to-end OCR models, exemplified by DeepSeek OCR, have once again thrust OCR into the spotlight. A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation. This stands in stark contrast to humans, who exhibit no such decline in efficiency during long-horizon copying tasks. In this technical report, we propose Unlimited OCR, a model designed to emulate human parsing working memory. Taking DeepSeek OCR as the baseline, we replace all attention layers in the decoder with our proposed Reference Sliding Window Attention (R-SWA), which reduces attention computation costs while maintaining a constant KV cache throughout the entire decoding process. By combining the high compression rate of DeepSeek OCR's encoder with our constant KV cache design, Unlimited OCR can transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K. More importantly, R-SWA is a general-purpose parsing attention mechanism - beyond OCR, it is equally applicable to tasks such as ASR, translation, etc. Codes and model weights are publicly available at http://github.com/baidu/Unlimited-OCR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Unlimited OCR, an extension of DeepSeek OCR that replaces all decoder attention layers with Reference Sliding Window Attention (R-SWA). This is claimed to enforce a constant KV cache size while preserving the LLM decoder's language prior, enabling transcription of dozens of pages in one forward pass under a 32K context limit. R-SWA is further positioned as a general-purpose mechanism applicable beyond OCR to tasks such as ASR and translation.
Significance. If the accuracy-preservation claim holds with supporting measurements, the work would address a practical bottleneck in long-context LLM-based OCR and parsing models by decoupling memory usage from output length. The public release of code and weights would further strengthen its potential impact as a reusable attention variant.
major comments (2)
- [Abstract] Abstract: The central claim that R-SWA 'maintains' the language prior benefits and overall OCR accuracy of the DeepSeek OCR baseline is asserted without any CER/WER numbers, ablation tables, or direct comparisons, leaving the accuracy-preservation step as an unevaluated assumption rather than a demonstrated result.
- [Abstract] Abstract: No quantitative results, ablation studies, or error analysis are supplied to support the performance claims (dozens of pages in a single 32K pass) or the generality claim for ASR/translation, which are load-bearing for the paper's contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the need for empirical support. We respond point-by-point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that R-SWA 'maintains' the language prior benefits and overall OCR accuracy of the DeepSeek OCR baseline is asserted without any CER/WER numbers, ablation tables, or direct comparisons, leaving the accuracy-preservation step as an unevaluated assumption rather than a demonstrated result.
Authors: We agree the comment is correct: the abstract asserts preservation of the language prior without supporting measurements or comparisons. The technical report centers on the R-SWA design for constant KV cache while retaining the original decoder structure, but this does not constitute a demonstration. We will revise the abstract to describe R-SWA as intended to preserve the prior rather than claiming it maintains accuracy. revision: yes
-
Referee: [Abstract] Abstract: No quantitative results, ablation studies, or error analysis are supplied to support the performance claims (dozens of pages in a single 32K pass) or the generality claim for ASR/translation, which are load-bearing for the paper's contribution.
Authors: We agree the comment is correct: the abstract states the ability to transcribe dozens of pages and positions R-SWA as general-purpose without quantitative results, ablations, or error analysis. The constant-KV-cache property is a direct consequence of the sliding-window design, but the specific performance numbers and cross-task applicability are not demonstrated. We will revise the abstract and add a limitations paragraph to qualify these statements as design implications rather than evaluated outcomes. revision: yes
- Supplying CER/WER numbers, ablation tables, error analysis, or results on ASR/translation tasks, as the current manuscript is a method-focused technical report without experimental evaluations.
Circularity Check
No derivation chain present; architectural proposal only
full rationale
The manuscript describes an engineering modification: replace decoder attention layers in an external baseline (DeepSeek OCR) with a new mechanism called R-SWA to enforce constant KV cache size. No equations, no fitted parameters, no derived predictions, and no self-citations appear in the provided text. The central claim is a design assertion rather than a mathematical reduction, so no load-bearing step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttps://huggingface.co/nanonets/Nanonets-OCR-s
Nanonets-ocr-s, 2025. URLhttps://huggingface.co/nanonets/Nanonets-OCR-s
2025
-
[2]
URLhttps://github.com/DocTron-hub/OCRVerse
Ocrverse, 2025. URLhttps://github.com/DocTron-hub/OCRVerse
2025
-
[3]
URLhttps://github.com/chatdoc-com/OCRFlux
Ocrflux, 2025. URLhttps://github.com/chatdoc-com/OCRFlux
2025
-
[4]
G. AI. Gemini 2.5-pro, 2025. URLhttps://gemini.google.com/
2025
-
[5]
URLhttps://github.com/alibaba/Logics-Parsing
alibaba, 2026. URLhttps://github.com/alibaba/Logics-Parsing
2026
-
[6]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023
Pith/arXiv arXiv 2023
-
[8]
URLhttps://arxiv.org/abs/2511.21631
-
[9]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P . Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025
Pith/arXiv arXiv 2025
-
[10]
L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023
Pith/arXiv arXiv 2023
-
[11]
C. Cui, T. Sun, S. Liang, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model. arXiv preprint arXiv:2510.14528, 2025
arXiv 2025
-
[12]
C. Cui, T. Sun, M. Lin, T. Gao, Y. Zhang, J. Liu, X. Wang, Z. Zhang, C. Zhou, H. Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025
Pith/arXiv arXiv 2025
-
[13]
D. Dong, M. Zheng, D. Xu, C. Luo, B. Zhuang, Y. Li, R. He, H. Wang, W. Zhang, W. Wang, et al. Qianfan-ocr: A unified end-to-end model for document intelligence. arXiv preprint arXiv:2603.13398, 2026
arXiv 2026
-
[14]
H. Feng, S. Wei, X. Fei, W. Shi, Y. Han, L. Liao, J. Lu, B. Wu, Q. Liu, C. Lin, et al. Dol- phin: Document image parsing via heterogeneous anchor prompting. arXiv preprint arXiv:2505.14059, 2025
arXiv 2025
- [15]
-
[16]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023. 12
Pith/arXiv arXiv 2023
-
[17]
J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023
2023
-
[18]
Z. Li, Y. Liu, Q. Liu, Z. Ma, Z. Zhang, S. Zhang, Z. Guo, J. Zhang, X. Wang, and X. Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025
arXiv 2025
-
[19]
A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025
Pith/arXiv arXiv 2025
-
[20]
Y. Liu, Z. Zhao, L. Tian, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1576–1601, November 2025
2025
-
[21]
I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016
Pith/arXiv arXiv 2016
-
[22]
Loshchilov and F
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019
2019
-
[23]
Gpt-4 technical report, 2023
OpenAI. Gpt-4 technical report, 2023
2023
-
[24]
Ouyang, Y
L. Ouyang, Y. Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, Z. Zhao, M. Jiang, X. Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025
2025
-
[25]
J. Poznanski, A. Rangapur, J. Borchardt, J. Dunkelberger, R. Huff, D. Lin, C. Wilhelm, K. Lo, and L. Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443, 2025
arXiv 2025
-
[26]
Radford, J
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P . Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021
2021
-
[27]
dots.ocr, 2025
Rednote. dots.ocr, 2025. URLhttps://github.com/rednote-hilab/dots.ocr
2025
-
[28]
M. Shoeybi, M. Patwary, R. Puri, P . LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019
Pith/arXiv arXiv 1909
-
[29]
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
Pith/arXiv arXiv 2023
-
[30]
H. V . Team, P . Lyu, X. Wan, G. Li, S. Peng, W. Wang, L. Wu, H. Shen, Y. Zhou, C. Tang, et al. Hunyuanocr technical report. arXiv preprint arXiv:2511.19575, 2025
arXiv 2025
-
[31]
B. Wang, C. Xu, X. Zhao, L. Ouyang, F. Wu, Z. Zhao, R. Xu, K. Liu, Y. Qu, F. Shang, et al. Mineru: An open-source solution for precise document content extraction. arXiv preprint arXiv:2409.18839, 2024. 13
Pith/arXiv arXiv 2024
-
[32]
W. Wang, Z. Gao, L. Gu, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
Pith/arXiv arXiv 2025
-
[33]
H. Wei, L. Kong, J. Chen, L. Zhao, Z. Ge, J. Yang, J. Sun, C. Han, and X. Zhang. Vary: Scaling up the vision vocabulary for large vision-language model. In European Conference on Computer Vision, pages 408–424. Springer, 2024
2024
-
[34]
H. Wei, C. Liu, J. Chen, J. Wang, L. Kong, Y. Xu, Z. Ge, L. Zhao, J. Sun, Y. Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model. arXiv preprint arXiv:2409.01704, 2024
Pith/arXiv arXiv 2024
-
[35]
H. Wei, Y. Sun, and Y. Li. Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234, 2025
Pith/arXiv arXiv 2025
-
[36]
H. Wei, Y. Sun, and Y. Li. Deepseek-ocr 2: Visual causal flow. arXiv preprint arXiv:2601.20552, 2026
arXiv 2026
-
[37]
H. Wu, H. Lou, X. Li, Z. Zhong, Z. Sun, P . Chen, X. Zhou, K. Zuo, Y. Chen, X. Tang, et al. Firered-ocr technical report. arXiv preprint arXiv:2603.01840, 2026
arXiv 2026
-
[38]
J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y. Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 14
Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.