Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production
Pith reviewed 2026-05-20 21:48 UTC · model grok-4.3
The pith
A microservice architecture runs OCR and LLM document pipelines at production scale on thousands of pages per hour.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A microservice architecture that encapsulates pipelines of classification, optical character recognition, and large language model structured field extraction, together with hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, asynchronous processing for IO-bound operations, and independent horizontal scaling, supports reliable production operation at the scale of thousands of multi-page documents per hour, where profiling reveals that OCR dominates end-to-end latency and that concurrency is limited by shared GPU-inference capacity rather than worker count.
What carries the argument
Microservice architecture that orchestrates hybrid classification, GPU inference for OCR and LLMs, asynchronous CPU tasks, and independent horizontal scaling of components.
If this is right
- Teams should profile pipelines to confirm that OCR accounts for most latency and direct optimization effort there rather than to the language-model stage.
- Resource allocation should prioritize GPU capacity over additional CPU workers because concurrency saturates at the shared inference hardware.
- Separating GPU inference services from CPU orchestration services allows each to scale at its own rate without blocking the other.
- Hybrid classification can route documents to the most suitable downstream models while keeping the overall pipeline responsive.
Where Pith is reading between the lines
- The same separation of inference and orchestration services could be tested on other high-volume AI workloads that mix vision and language models.
- Measuring cost per document under variable cloud pricing would show whether the asynchronous and independent-scaling choices reduce expenses in practice.
- Repeating the batch profiling after swapping in newer OCR models would test whether the dominance of OCR over LLM steps persists.
Load-bearing premise
The described design decisions around task separation, asynchronous processing, and independent scaling will continue to produce the same latency and concurrency behavior for document types and hardware setups different from the ones tested.
What would settle it
Deploying the same pipeline on a new collection of documents or different hardware and measuring that end-to-end latency is no longer dominated by OCR or that adding more workers increases throughput beyond the GPU limit would challenge the reported findings.
Figures
read the original abstract
Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a microservice architecture for production document AI pipelines that combine hybrid document classification, OCR, and LLM-based structured field extraction. It details design choices such as separating GPU-bound inference from CPU-bound orchestration, asynchronous handling of IO-bound steps, and independent horizontal scaling. The authors report operational experience processing thousands of multi-page documents per hour and describe two qualitative findings from batch profiling: OCR dominates end-to-end latency, and saturation is governed by shared GPU capacity rather than worker count.
Significance. If the reported observations hold, the work supplies concrete, practitioner-oriented architectural patterns that address the well-known gap between model-centric research and scalable deployment of document understanding systems. The emphasis on real throughput, GPU/CPU separation, and async processing constitutes a useful contribution for applied AI venues, particularly when accompanied by reproducible code or detailed metrics.
major comments (2)
- [Abstract and batch profiling section] Abstract and the section describing batch profiling: the two central qualitative findings (OCR latency dominance and GPU-limited saturation) are stated without any accompanying quantitative values, such as measured per-document latencies, throughput in docs/hour, GPU utilization percentages, or comparisons against a baseline monolithic deployment. These omissions make it difficult to assess the magnitude or reproducibility of the observations that motivate the architectural recommendations.
- [Production experience section] The section on production experience: the claim of successful operation at thousands of multi-page documents per hour is presented without error rates, extraction accuracy figures, or resource-consumption breakdowns that would demonstrate the load-bearing impact of the hybrid classification and scaling decisions.
minor comments (2)
- [Design decisions section] Add a simple architecture diagram or pseudocode snippet illustrating the hybrid classification step and the GPU/CPU separation to improve clarity for readers unfamiliar with the deployment.
- [Abstract] The abstract states the goal of providing 'concrete architectural patterns'; consider adding a short table summarizing the four primary design decisions and their observed effects.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recommendation. We have revised the manuscript to incorporate quantitative details addressing the concerns about missing metrics while preserving the paper's focus on architectural patterns and operational insights.
read point-by-point responses
-
Referee: [Abstract and batch profiling section] Abstract and the section describing batch profiling: the two central qualitative findings (OCR latency dominance and GPU-limited saturation) are stated without any accompanying quantitative values, such as measured per-document latencies, throughput in docs/hour, GPU utilization percentages, or comparisons against a baseline monolithic deployment. These omissions make it difficult to assess the magnitude or reproducibility of the observations that motivate the architectural recommendations.
Authors: We agree that quantitative values strengthen the presentation of our findings. In the revised version, we have added specific metrics from our batch profiling runs, including average per-document latencies for OCR versus LLM steps, observed throughput in documents per hour, GPU utilization percentages at saturation points, and a discussion of how these compare to expectations from a monolithic deployment. These additions are integrated into the batch profiling section and referenced in the abstract to better support the architectural recommendations. revision: yes
-
Referee: [Production experience section] The section on production experience: the claim of successful operation at thousands of multi-page documents per hour is presented without error rates, extraction accuracy figures, or resource-consumption breakdowns that would demonstrate the load-bearing impact of the hybrid classification and scaling decisions.
Authors: We acknowledge that additional metrics would better illustrate the impact of our design choices. The revised production experience section now includes aggregate extraction accuracy figures from our deployment, as well as resource-consumption breakdowns (CPU, GPU, and memory utilization) under sustained load. We have also clarified the role of hybrid classification and independent scaling in achieving the reported throughput. Detailed per-component error rates are not provided because they are highly dependent on document quality and specific model versions; we instead report overall system-level reliability metrics. revision: partial
Circularity Check
No significant circularity in applied engineering report
full rationale
The paper is an applied engineering report that describes a microservice architecture for document AI pipelines (classification, OCR, LLM extraction) and shares qualitative observations from production runs at thousands of multi-page documents per hour. It details concrete design decisions such as hybrid classification, GPU/CPU separation, asynchronous processing, and horizontal scaling, plus two profiled findings on latency and saturation. There are no mathematical derivations, equations, fitted parameters, predictions, or self-referential claims that reduce to prior assumptions or inputs within the paper. The central contribution consists of deployment-specific architectural patterns and empirical observations presented as tied to the authors' environment, with no load-bearing steps relying on self-citation chains or definitional equivalences. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Microservice architectures with separation of GPU and CPU workloads are suitable for scaling document processing pipelines.
Reference graph
Works this paper leans on
-
[1]
Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Nikolaos Livathinos, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Fabian Lindlbauer, Kasper Dinkla, Lokesh Mishra, Yusik Kim, Shubham Gupta, Rafael Teixeira de Lima, Valery Weber, Lucas Morin, Ingmar Meijer, Viktor Kuropiatnyk, and Peter W. J. Staar. Docling technical report.arXiv preprin...
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jin- gren Zhou. Qwen-VL: A versatile vision-language model 8 for understanding, localization, text reading, and be- yond.arXiv preprint arXiv:2308.12966,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Nougat: Neural Optical Understanding for Academic Documents
Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical under- standing for academic documents.arXiv preprint arXiv:2308.13418,
work page internal anchor Pith review Pith/arXiv arXiv
- [4]
-
[5]
Pp-ocr: A practical ultra lightweight ocr system
Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Wei- wei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, and Haoshuang Wang. PP-OCR: A practical ultra lightweight OCR system.arXiv preprint arXiv:2009.09941,
-
[6]
Introducing FBLearner Flow: Face- book’s AI backbone
Jeffrey Dunn. Introducing FBLearner Flow: Face- book’s AI backbone. Meta Engineering Blog, https: //engineering.fb.com/2016/05/09/core-infra/ introducing-fblearner-flow-facebook-s-ai-backbone/ ,
work page 2016
-
[7]
ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C´ eline Hudelot, and Pierre Colombo. ColPali: Efficient document retrieval with vision lan- guage models.arXiv preprint arXiv:2407.01449,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Practical guide on doc- ument understanding: From OCR to VLM
Yao Fehlis, Zhangzhang Si, Steve Kramer, Analysa Gon- zales, and Michael Wharton. Practical guide on doc- ument understanding: From OCR to VLM. https: //doi.org/10.5281/zenodo.18020024, dec
-
[9]
Gemini: A Family of Highly Capable Multimodal Models
Google DeepMind. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi, Spencer Romo, Akhil Nooney, David Kaleko, Boyi Xie, Bob Strahan, and Diego A. Socolinsky. IDP accelerator: Agentic document intelligence from extraction to compliance validation.arXiv preprint arXiv:2602.23481,
-
[11]
DocVLM: Make your VLM an efficient reader.arXiv preprint arXiv:2412.08746,
Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Ma- zor, and Ron Litman. DocVLM: Make your VLM an efficient reader.arXiv preprint arXiv:2412.08746,
-
[12]
com/en/products/pd/equipment/scanners/ fi-8950-production-scanner
URL https://www.ricoh-usa. com/en/products/pd/equipment/scanners/ fi-8950-production-scanner . Accessed: 2026- 05-11. Isaac Robinson, Peter Robicheaux, Matvei Popov, Deva Ramanan, and Neehar Peri. RF-DETR: Neural architec- ture search for real-time detection transformers.arXiv preprint arXiv:2511.09554,
-
[13]
MMORE: Massive multimodal open RAG & extraction.arXiv preprint arXiv:2509.11937,
9 Alexandre Sallinen, Stefan Krsteski, Paul Teiletche, Marc- Antoine Allard, Baptiste Lecoeur, Michael Zhang, Fab- rice Nemo, David Kalajdzic, Matthias Meyer, and Mary- Anne Hartley. MMORE: Massive multimodal open RAG & extraction.arXiv preprint arXiv:2509.11937,
-
[14]
Jiyuan Shen, Peiyue Yuan, Atin Ghosh, Yifan Mai, and Daniel Dahlmeier. OCR or not? rethinking document in- formation extraction in the MLLMs era with real-world large-scale datasets.arXiv preprint arXiv:2603.02789,
-
[15]
Zilong Wang and Xiaoyu Shen. Hybrid OCR-LLM framework for enterprise-scale document information extraction under copy-heavy task.arXiv preprint arXiv:2510.10138,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.