Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

Benjamin Bengfort; Ben Johnson; Devon Slonaker; Joyce Rigelo; Michael Wharton; Patrick Deziel; Prema Roman; Steve Kramer; Steve Veldman; Vahid Eyorokon

arxiv: 2605.18818 · v1 · pith:XUSEJK3Vnew · submitted 2026-05-12 · 💻 cs.AI · cs.LG· cs.SE

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

Yao Fehlis , Benjamin Bengfort , Zhangzhang Si , Vahid Eyorokon , Prema Roman , Patrick Deziel , Devon Slonaker , Steve Veldman

show 4 more authors

Ben Johnson Joyce Rigelo Michael Wharton Steve Kramer

This is my paper

Pith reviewed 2026-05-20 21:48 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.SE

keywords document AImicroservice architectureOCRlarge language modelsproduction pipelinesGPU inferenceasynchronous processinghorizontal scaling

0 comments

The pith

A microservice architecture runs OCR and LLM document pipelines at production scale on thousands of pages per hour.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a microservice design that chains together classification, optical character recognition, and large language model steps to extract structured fields from real documents. It explains concrete choices such as keeping GPU model calls separate from CPU orchestration, running input-output work asynchronously, and scaling each service on its own. The authors describe operating the system at thousands of multi-page documents per hour and report two observations from batch profiling: OCR steps take longer than the language-model extraction, and overall throughput is capped by shared GPU capacity rather than by the number of worker processes. The goal is to give engineers reusable patterns that move document AI past research benchmarks into reliable daily use.

Core claim

A microservice architecture that encapsulates pipelines of classification, optical character recognition, and large language model structured field extraction, together with hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, asynchronous processing for IO-bound operations, and independent horizontal scaling, supports reliable production operation at the scale of thousands of multi-page documents per hour, where profiling reveals that OCR dominates end-to-end latency and that concurrency is limited by shared GPU-inference capacity rather than worker count.

What carries the argument

Microservice architecture that orchestrates hybrid classification, GPU inference for OCR and LLMs, asynchronous CPU tasks, and independent horizontal scaling of components.

If this is right

Teams should profile pipelines to confirm that OCR accounts for most latency and direct optimization effort there rather than to the language-model stage.
Resource allocation should prioritize GPU capacity over additional CPU workers because concurrency saturates at the shared inference hardware.
Separating GPU inference services from CPU orchestration services allows each to scale at its own rate without blocking the other.
Hybrid classification can route documents to the most suitable downstream models while keeping the overall pipeline responsive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of inference and orchestration services could be tested on other high-volume AI workloads that mix vision and language models.
Measuring cost per document under variable cloud pricing would show whether the asynchronous and independent-scaling choices reduce expenses in practice.
Repeating the batch profiling after swapping in newer OCR models would test whether the dominance of OCR over LLM steps persists.

Load-bearing premise

The described design decisions around task separation, asynchronous processing, and independent scaling will continue to produce the same latency and concurrency behavior for document types and hardware setups different from the ones tested.

What would settle it

Deploying the same pipeline on a new collection of documents or different hardware and measuring that end-to-end latency is no longer dominated by OCR or that adding more workers increases throughput beyond the GPU limit would challenge the reported findings.

Figures

Figures reproduced from arXiv: 2605.18818 by Benjamin Bengfort, Ben Johnson, Devon Slonaker, Joyce Rigelo, Michael Wharton, Patrick Deziel, Prema Roman, Steve Kramer, Steve Veldman, Vahid Eyorokon, Yao Fehlis, Zhangzhang Si.

**Figure 1.** Figure 1: System architecture. The Gateway accepts submissions, persists page images in object storage and tracking [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Hybrid classification strategy. CLIP-KNN classi [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Schematic saturation behavior. Throughput [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Tiered bottleneck progression. Worker pods [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical engineering report on microservice patterns for document AI pipelines with some useful production observations, but it stays incremental and lacks strong quantitative backing.

read the letter

This paper is mainly a report on how to run document AI pipelines in production using a microservice architecture. The authors combine classification, OCR, and LLM extraction and share what they learned from processing thousands of multi-page documents per hour. They do well in spelling out practical design decisions like using hybrid classification, separating GPU-bound inference from CPU orchestration, handling IO with async processing, and scaling horizontally. The two profiled findings are worth noting: OCR dominates the latency, and the bottleneck is shared GPU capacity rather than the number of workers. These observations come from actual production runs and can save others from similar surprises. The soft spots are minor but real. The work is light on quantitative benchmarks, baselines, or error rates, so it's hard to measure the improvement over other setups. The findings are tied to their specific documents and hardware, which limits how far they generalize. No formal proofs or new algorithms here, just applied experience. This is for industry practitioners building high-throughput document systems. Academic readers might find it less central unless they're interested in deployment issues. It deserves a serious referee because it fills a gap between model papers and real-world use, with honest reporting of what worked in their case. I recommend sending it to peer review, but frame it as an engineering case study.

Referee Report

2 major / 2 minor

Summary. The paper presents a microservice architecture for production document AI pipelines that combine hybrid document classification, OCR, and LLM-based structured field extraction. It details design choices such as separating GPU-bound inference from CPU-bound orchestration, asynchronous handling of IO-bound steps, and independent horizontal scaling. The authors report operational experience processing thousands of multi-page documents per hour and describe two qualitative findings from batch profiling: OCR dominates end-to-end latency, and saturation is governed by shared GPU capacity rather than worker count.

Significance. If the reported observations hold, the work supplies concrete, practitioner-oriented architectural patterns that address the well-known gap between model-centric research and scalable deployment of document understanding systems. The emphasis on real throughput, GPU/CPU separation, and async processing constitutes a useful contribution for applied AI venues, particularly when accompanied by reproducible code or detailed metrics.

major comments (2)

[Abstract and batch profiling section] Abstract and the section describing batch profiling: the two central qualitative findings (OCR latency dominance and GPU-limited saturation) are stated without any accompanying quantitative values, such as measured per-document latencies, throughput in docs/hour, GPU utilization percentages, or comparisons against a baseline monolithic deployment. These omissions make it difficult to assess the magnitude or reproducibility of the observations that motivate the architectural recommendations.
[Production experience section] The section on production experience: the claim of successful operation at thousands of multi-page documents per hour is presented without error rates, extraction accuracy figures, or resource-consumption breakdowns that would demonstrate the load-bearing impact of the hybrid classification and scaling decisions.

minor comments (2)

[Design decisions section] Add a simple architecture diagram or pseudocode snippet illustrating the hybrid classification step and the GPU/CPU separation to improve clarity for readers unfamiliar with the deployment.
[Abstract] The abstract states the goal of providing 'concrete architectural patterns'; consider adding a short table summarizing the four primary design decisions and their observed effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We have revised the manuscript to incorporate quantitative details addressing the concerns about missing metrics while preserving the paper's focus on architectural patterns and operational insights.

read point-by-point responses

Referee: [Abstract and batch profiling section] Abstract and the section describing batch profiling: the two central qualitative findings (OCR latency dominance and GPU-limited saturation) are stated without any accompanying quantitative values, such as measured per-document latencies, throughput in docs/hour, GPU utilization percentages, or comparisons against a baseline monolithic deployment. These omissions make it difficult to assess the magnitude or reproducibility of the observations that motivate the architectural recommendations.

Authors: We agree that quantitative values strengthen the presentation of our findings. In the revised version, we have added specific metrics from our batch profiling runs, including average per-document latencies for OCR versus LLM steps, observed throughput in documents per hour, GPU utilization percentages at saturation points, and a discussion of how these compare to expectations from a monolithic deployment. These additions are integrated into the batch profiling section and referenced in the abstract to better support the architectural recommendations. revision: yes
Referee: [Production experience section] The section on production experience: the claim of successful operation at thousands of multi-page documents per hour is presented without error rates, extraction accuracy figures, or resource-consumption breakdowns that would demonstrate the load-bearing impact of the hybrid classification and scaling decisions.

Authors: We acknowledge that additional metrics would better illustrate the impact of our design choices. The revised production experience section now includes aggregate extraction accuracy figures from our deployment, as well as resource-consumption breakdowns (CPU, GPU, and memory utilization) under sustained load. We have also clarified the role of hybrid classification and independent scaling in achieving the reported throughput. Detailed per-component error rates are not provided because they are highly dependent on document quality and specific model versions; we instead report overall system-level reliability metrics. revision: partial

Circularity Check

0 steps flagged

No significant circularity in applied engineering report

full rationale

The paper is an applied engineering report that describes a microservice architecture for document AI pipelines (classification, OCR, LLM extraction) and shares qualitative observations from production runs at thousands of multi-page documents per hour. It details concrete design decisions such as hybrid classification, GPU/CPU separation, asynchronous processing, and horizontal scaling, plus two profiled findings on latency and saturation. There are no mathematical derivations, equations, fitted parameters, predictions, or self-referential claims that reduce to prior assumptions or inputs within the paper. The central contribution consists of deployment-specific architectural patterns and empirical observations presented as tied to the authors' environment, with no load-bearing steps relying on self-citation chains or definitional equivalences. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard software engineering assumptions about microservices and workload separation without introducing fitted parameters, new entities, or unstated mathematical axioms.

axioms (1)

domain assumption Microservice architectures with separation of GPU and CPU workloads are suitable for scaling document processing pipelines.
Invoked as the basis for the primary design decisions without derivation in the paper.

pith-pipeline@v0.9.0 · 5748 in / 1222 out tokens · 47151 ms · 2026-05-20T21:48:51.771793+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Nikolaos Livathinos, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Fabian Lindlbauer, Kasper Dinkla, Lokesh Mishra, Yusik Kim, Shubham Gupta, Rafael Teixeira de Lima, Valery Weber, Lucas Morin, Ingmar Meijer, Viktor Kuropiatnyk, and Peter W. J. Staar. Docling technical report.arXiv preprin...

work page arXiv
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jin- gren Zhou. Qwen-VL: A versatile vision-language model 8 for understanding, localization, text reading, and be- yond.arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical under- standing for academic documents.arXiv preprint arXiv:2308.13418,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Lilu Cheng, Jingjun Lu, Yi Xuan Chan, Quoc Khai Nguyen, John Bi, and Sean Ho. A hybrid architecture for multi-stage claim document understanding: Com- bining vision-language models and machine learning for real-time processing.arXiv preprint arXiv:2601.01897,

work page arXiv
[5]

Pp-ocr: A practical ultra lightweight ocr system

Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Wei- wei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, and Haoshuang Wang. PP-OCR: A practical ultra lightweight OCR system.arXiv preprint arXiv:2009.09941,

work page arXiv 2009
[6]

Introducing FBLearner Flow: Face- book’s AI backbone

Jeffrey Dunn. Introducing FBLearner Flow: Face- book’s AI backbone. Meta Engineering Blog, https: //engineering.fb.com/2016/05/09/core-infra/ introducing-fblearner-flow-facebook-s-ai-backbone/ ,

work page 2016
[7]

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C´ eline Hudelot, and Pierre Colombo. ColPali: Efficient document retrieval with vision lan- guage models.arXiv preprint arXiv:2407.01449,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Practical guide on doc- ument understanding: From OCR to VLM

Yao Fehlis, Zhangzhang Si, Steve Kramer, Analysa Gon- zales, and Michael Wharton. Practical guide on doc- ument understanding: From OCR to VLM. https: //doi.org/10.5281/zenodo.18020024, dec

work page doi:10.5281/zenodo.18020024
[9]

Gemini: A Family of Highly Capable Multimodal Models

Google DeepMind. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Socolinsky

Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi, Spencer Romo, Akhil Nooney, David Kaleko, Boyi Xie, Bob Strahan, and Diego A. Socolinsky. IDP accelerator: Agentic document intelligence from extraction to compliance validation.arXiv preprint arXiv:2602.23481,

work page arXiv
[11]

DocVLM: Make your VLM an efficient reader.arXiv preprint arXiv:2412.08746,

Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Ma- zor, and Ron Litman. DocVLM: Make your VLM an efficient reader.arXiv preprint arXiv:2412.08746,

work page arXiv
[12]

com/en/products/pd/equipment/scanners/ fi-8950-production-scanner

URL https://www.ricoh-usa. com/en/products/pd/equipment/scanners/ fi-8950-production-scanner . Accessed: 2026- 05-11. Isaac Robinson, Peter Robicheaux, Matvei Popov, Deva Ramanan, and Neehar Peri. RF-DETR: Neural architec- ture search for real-time detection transformers.arXiv preprint arXiv:2511.09554,

work page arXiv 2026
[13]

MMORE: Massive multimodal open RAG & extraction.arXiv preprint arXiv:2509.11937,

9 Alexandre Sallinen, Stefan Krsteski, Paul Teiletche, Marc- Antoine Allard, Baptiste Lecoeur, Michael Zhang, Fab- rice Nemo, David Kalajdzic, Matthias Meyer, and Mary- Anne Hartley. MMORE: Massive multimodal open RAG & extraction.arXiv preprint arXiv:2509.11937,

work page arXiv
[14]

OCR or not? rethinking document in- formation extraction in the MLLMs era with real-world large-scale datasets.arXiv preprint arXiv:2603.02789,

Jiyuan Shen, Peiyue Yuan, Atin Ghosh, Yifan Mai, and Daniel Dahlmeier. OCR or not? rethinking document in- formation extraction in the MLLMs era with real-world large-scale datasets.arXiv preprint arXiv:2603.02789,

work page arXiv
[15]

Hybrid OCR-LLM framework for enterprise-scale document information extraction under copy-heavy task.arXiv preprint arXiv:2510.10138,

Zilong Wang and Xiaoyu Shen. Hybrid OCR-LLM framework for enterprise-scale document information extraction under copy-heavy task.arXiv preprint arXiv:2510.10138,

work page arXiv

[1] [1]

Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Nikolaos Livathinos, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Fabian Lindlbauer, Kasper Dinkla, Lokesh Mishra, Yusik Kim, Shubham Gupta, Rafael Teixeira de Lima, Valery Weber, Lucas Morin, Ingmar Meijer, Viktor Kuropiatnyk, and Peter W. J. Staar. Docling technical report.arXiv preprin...

work page arXiv

[2] [2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jin- gren Zhou. Qwen-VL: A versatile vision-language model 8 for understanding, localization, text reading, and be- yond.arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical under- standing for academic documents.arXiv preprint arXiv:2308.13418,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Lilu Cheng, Jingjun Lu, Yi Xuan Chan, Quoc Khai Nguyen, John Bi, and Sean Ho. A hybrid architecture for multi-stage claim document understanding: Com- bining vision-language models and machine learning for real-time processing.arXiv preprint arXiv:2601.01897,

work page arXiv

[5] [5]

Pp-ocr: A practical ultra lightweight ocr system

Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Wei- wei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, and Haoshuang Wang. PP-OCR: A practical ultra lightweight OCR system.arXiv preprint arXiv:2009.09941,

work page arXiv 2009

[6] [6]

Introducing FBLearner Flow: Face- book’s AI backbone

Jeffrey Dunn. Introducing FBLearner Flow: Face- book’s AI backbone. Meta Engineering Blog, https: //engineering.fb.com/2016/05/09/core-infra/ introducing-fblearner-flow-facebook-s-ai-backbone/ ,

work page 2016

[7] [7]

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C´ eline Hudelot, and Pierre Colombo. ColPali: Efficient document retrieval with vision lan- guage models.arXiv preprint arXiv:2407.01449,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Practical guide on doc- ument understanding: From OCR to VLM

Yao Fehlis, Zhangzhang Si, Steve Kramer, Analysa Gon- zales, and Michael Wharton. Practical guide on doc- ument understanding: From OCR to VLM. https: //doi.org/10.5281/zenodo.18020024, dec

work page doi:10.5281/zenodo.18020024

[9] [9]

Gemini: A Family of Highly Capable Multimodal Models

Google DeepMind. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Socolinsky

Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi, Spencer Romo, Akhil Nooney, David Kaleko, Boyi Xie, Bob Strahan, and Diego A. Socolinsky. IDP accelerator: Agentic document intelligence from extraction to compliance validation.arXiv preprint arXiv:2602.23481,

work page arXiv

[11] [11]

DocVLM: Make your VLM an efficient reader.arXiv preprint arXiv:2412.08746,

Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Ma- zor, and Ron Litman. DocVLM: Make your VLM an efficient reader.arXiv preprint arXiv:2412.08746,

work page arXiv

[12] [12]

com/en/products/pd/equipment/scanners/ fi-8950-production-scanner

URL https://www.ricoh-usa. com/en/products/pd/equipment/scanners/ fi-8950-production-scanner . Accessed: 2026- 05-11. Isaac Robinson, Peter Robicheaux, Matvei Popov, Deva Ramanan, and Neehar Peri. RF-DETR: Neural architec- ture search for real-time detection transformers.arXiv preprint arXiv:2511.09554,

work page arXiv 2026

[13] [13]

MMORE: Massive multimodal open RAG & extraction.arXiv preprint arXiv:2509.11937,

9 Alexandre Sallinen, Stefan Krsteski, Paul Teiletche, Marc- Antoine Allard, Baptiste Lecoeur, Michael Zhang, Fab- rice Nemo, David Kalajdzic, Matthias Meyer, and Mary- Anne Hartley. MMORE: Massive multimodal open RAG & extraction.arXiv preprint arXiv:2509.11937,

work page arXiv

[14] [14]

OCR or not? rethinking document in- formation extraction in the MLLMs era with real-world large-scale datasets.arXiv preprint arXiv:2603.02789,

Jiyuan Shen, Peiyue Yuan, Atin Ghosh, Yifan Mai, and Daniel Dahlmeier. OCR or not? rethinking document in- formation extraction in the MLLMs era with real-world large-scale datasets.arXiv preprint arXiv:2603.02789,

work page arXiv

[15] [15]

Hybrid OCR-LLM framework for enterprise-scale document information extraction under copy-heavy task.arXiv preprint arXiv:2510.10138,

Zilong Wang and Xiaoyu Shen. Hybrid OCR-LLM framework for enterprise-scale document information extraction under copy-heavy task.arXiv preprint arXiv:2510.10138,

work page arXiv