pith. sign in

arxiv: 2605.18818 · v1 · pith:XUSEJK3Vnew · submitted 2026-05-12 · 💻 cs.AI · cs.LG· cs.SE

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

Pith reviewed 2026-05-20 21:48 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.SE
keywords document AImicroservice architectureOCRlarge language modelsproduction pipelinesGPU inferenceasynchronous processinghorizontal scaling
0
0 comments X

The pith

A microservice architecture runs OCR and LLM document pipelines at production scale on thousands of pages per hour.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a microservice design that chains together classification, optical character recognition, and large language model steps to extract structured fields from real documents. It explains concrete choices such as keeping GPU model calls separate from CPU orchestration, running input-output work asynchronously, and scaling each service on its own. The authors describe operating the system at thousands of multi-page documents per hour and report two observations from batch profiling: OCR steps take longer than the language-model extraction, and overall throughput is capped by shared GPU capacity rather than by the number of worker processes. The goal is to give engineers reusable patterns that move document AI past research benchmarks into reliable daily use.

Core claim

A microservice architecture that encapsulates pipelines of classification, optical character recognition, and large language model structured field extraction, together with hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, asynchronous processing for IO-bound operations, and independent horizontal scaling, supports reliable production operation at the scale of thousands of multi-page documents per hour, where profiling reveals that OCR dominates end-to-end latency and that concurrency is limited by shared GPU-inference capacity rather than worker count.

What carries the argument

Microservice architecture that orchestrates hybrid classification, GPU inference for OCR and LLMs, asynchronous CPU tasks, and independent horizontal scaling of components.

If this is right

  • Teams should profile pipelines to confirm that OCR accounts for most latency and direct optimization effort there rather than to the language-model stage.
  • Resource allocation should prioritize GPU capacity over additional CPU workers because concurrency saturates at the shared inference hardware.
  • Separating GPU inference services from CPU orchestration services allows each to scale at its own rate without blocking the other.
  • Hybrid classification can route documents to the most suitable downstream models while keeping the overall pipeline responsive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of inference and orchestration services could be tested on other high-volume AI workloads that mix vision and language models.
  • Measuring cost per document under variable cloud pricing would show whether the asynchronous and independent-scaling choices reduce expenses in practice.
  • Repeating the batch profiling after swapping in newer OCR models would test whether the dominance of OCR over LLM steps persists.

Load-bearing premise

The described design decisions around task separation, asynchronous processing, and independent scaling will continue to produce the same latency and concurrency behavior for document types and hardware setups different from the ones tested.

What would settle it

Deploying the same pipeline on a new collection of documents or different hardware and measuring that end-to-end latency is no longer dominated by OCR or that adding more workers increases throughput beyond the GPU limit would challenge the reported findings.

Figures

Figures reproduced from arXiv: 2605.18818 by Benjamin Bengfort, Ben Johnson, Devon Slonaker, Joyce Rigelo, Michael Wharton, Patrick Deziel, Prema Roman, Steve Kramer, Steve Veldman, Vahid Eyorokon, Yao Fehlis, Zhangzhang Si.

Figure 1
Figure 1. Figure 1: System architecture. The Gateway accepts submissions, persists page images in object storage and tracking [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Hybrid classification strategy. CLIP-KNN classi [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Schematic saturation behavior. Throughput [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tiered bottleneck progression. Worker pods [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a microservice architecture for production document AI pipelines that combine hybrid document classification, OCR, and LLM-based structured field extraction. It details design choices such as separating GPU-bound inference from CPU-bound orchestration, asynchronous handling of IO-bound steps, and independent horizontal scaling. The authors report operational experience processing thousands of multi-page documents per hour and describe two qualitative findings from batch profiling: OCR dominates end-to-end latency, and saturation is governed by shared GPU capacity rather than worker count.

Significance. If the reported observations hold, the work supplies concrete, practitioner-oriented architectural patterns that address the well-known gap between model-centric research and scalable deployment of document understanding systems. The emphasis on real throughput, GPU/CPU separation, and async processing constitutes a useful contribution for applied AI venues, particularly when accompanied by reproducible code or detailed metrics.

major comments (2)
  1. [Abstract and batch profiling section] Abstract and the section describing batch profiling: the two central qualitative findings (OCR latency dominance and GPU-limited saturation) are stated without any accompanying quantitative values, such as measured per-document latencies, throughput in docs/hour, GPU utilization percentages, or comparisons against a baseline monolithic deployment. These omissions make it difficult to assess the magnitude or reproducibility of the observations that motivate the architectural recommendations.
  2. [Production experience section] The section on production experience: the claim of successful operation at thousands of multi-page documents per hour is presented without error rates, extraction accuracy figures, or resource-consumption breakdowns that would demonstrate the load-bearing impact of the hybrid classification and scaling decisions.
minor comments (2)
  1. [Design decisions section] Add a simple architecture diagram or pseudocode snippet illustrating the hybrid classification step and the GPU/CPU separation to improve clarity for readers unfamiliar with the deployment.
  2. [Abstract] The abstract states the goal of providing 'concrete architectural patterns'; consider adding a short table summarizing the four primary design decisions and their observed effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We have revised the manuscript to incorporate quantitative details addressing the concerns about missing metrics while preserving the paper's focus on architectural patterns and operational insights.

read point-by-point responses
  1. Referee: [Abstract and batch profiling section] Abstract and the section describing batch profiling: the two central qualitative findings (OCR latency dominance and GPU-limited saturation) are stated without any accompanying quantitative values, such as measured per-document latencies, throughput in docs/hour, GPU utilization percentages, or comparisons against a baseline monolithic deployment. These omissions make it difficult to assess the magnitude or reproducibility of the observations that motivate the architectural recommendations.

    Authors: We agree that quantitative values strengthen the presentation of our findings. In the revised version, we have added specific metrics from our batch profiling runs, including average per-document latencies for OCR versus LLM steps, observed throughput in documents per hour, GPU utilization percentages at saturation points, and a discussion of how these compare to expectations from a monolithic deployment. These additions are integrated into the batch profiling section and referenced in the abstract to better support the architectural recommendations. revision: yes

  2. Referee: [Production experience section] The section on production experience: the claim of successful operation at thousands of multi-page documents per hour is presented without error rates, extraction accuracy figures, or resource-consumption breakdowns that would demonstrate the load-bearing impact of the hybrid classification and scaling decisions.

    Authors: We acknowledge that additional metrics would better illustrate the impact of our design choices. The revised production experience section now includes aggregate extraction accuracy figures from our deployment, as well as resource-consumption breakdowns (CPU, GPU, and memory utilization) under sustained load. We have also clarified the role of hybrid classification and independent scaling in achieving the reported throughput. Detailed per-component error rates are not provided because they are highly dependent on document quality and specific model versions; we instead report overall system-level reliability metrics. revision: partial

Circularity Check

0 steps flagged

No significant circularity in applied engineering report

full rationale

The paper is an applied engineering report that describes a microservice architecture for document AI pipelines (classification, OCR, LLM extraction) and shares qualitative observations from production runs at thousands of multi-page documents per hour. It details concrete design decisions such as hybrid classification, GPU/CPU separation, asynchronous processing, and horizontal scaling, plus two profiled findings on latency and saturation. There are no mathematical derivations, equations, fitted parameters, predictions, or self-referential claims that reduce to prior assumptions or inputs within the paper. The central contribution consists of deployment-specific architectural patterns and empirical observations presented as tied to the authors' environment, with no load-bearing steps relying on self-citation chains or definitional equivalences. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard software engineering assumptions about microservices and workload separation without introducing fitted parameters, new entities, or unstated mathematical axioms.

axioms (1)
  • domain assumption Microservice architectures with separation of GPU and CPU workloads are suitable for scaling document processing pipelines.
    Invoked as the basis for the primary design decisions without derivation in the paper.

pith-pipeline@v0.9.0 · 5748 in / 1222 out tokens · 47151 ms · 2026-05-20T21:48:51.771793+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 4 internal anchors

  1. [1]

    Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Nikolaos Livathinos, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Fabian Lindlbauer, Kasper Dinkla, Lokesh Mishra, Yusik Kim, Shubham Gupta, Rafael Teixeira de Lima, Valery Weber, Lucas Morin, Ingmar Meijer, Viktor Kuropiatnyk, and Peter W. J. Staar. Docling technical report.arXiv preprin...

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jin- gren Zhou. Qwen-VL: A versatile vision-language model 8 for understanding, localization, text reading, and be- yond.arXiv preprint arXiv:2308.12966,

  3. [3]

    Nougat: Neural Optical Understanding for Academic Documents

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical under- standing for academic documents.arXiv preprint arXiv:2308.13418,

  4. [4]

    Lilu Cheng, Jingjun Lu, Yi Xuan Chan, Quoc Khai Nguyen, John Bi, and Sean Ho. A hybrid architecture for multi-stage claim document understanding: Com- bining vision-language models and machine learning for real-time processing.arXiv preprint arXiv:2601.01897,

  5. [5]

    Pp-ocr: A practical ultra lightweight ocr system

    Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Wei- wei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, and Haoshuang Wang. PP-OCR: A practical ultra lightweight OCR system.arXiv preprint arXiv:2009.09941,

  6. [6]

    Introducing FBLearner Flow: Face- book’s AI backbone

    Jeffrey Dunn. Introducing FBLearner Flow: Face- book’s AI backbone. Meta Engineering Blog, https: //engineering.fb.com/2016/05/09/core-infra/ introducing-fblearner-flow-facebook-s-ai-backbone/ ,

  7. [7]

    ColPali: Efficient Document Retrieval with Vision Language Models

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C´ eline Hudelot, and Pierre Colombo. ColPali: Efficient document retrieval with vision lan- guage models.arXiv preprint arXiv:2407.01449,

  8. [8]

    Practical guide on doc- ument understanding: From OCR to VLM

    Yao Fehlis, Zhangzhang Si, Steve Kramer, Analysa Gon- zales, and Michael Wharton. Practical guide on doc- ument understanding: From OCR to VLM. https: //doi.org/10.5281/zenodo.18020024, dec

  9. [9]

    Gemini: A Family of Highly Capable Multimodal Models

    Google DeepMind. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  10. [10]

    Socolinsky

    Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi, Spencer Romo, Akhil Nooney, David Kaleko, Boyi Xie, Bob Strahan, and Diego A. Socolinsky. IDP accelerator: Agentic document intelligence from extraction to compliance validation.arXiv preprint arXiv:2602.23481,

  11. [11]

    DocVLM: Make your VLM an efficient reader.arXiv preprint arXiv:2412.08746,

    Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Ma- zor, and Ron Litman. DocVLM: Make your VLM an efficient reader.arXiv preprint arXiv:2412.08746,

  12. [12]

    com/en/products/pd/equipment/scanners/ fi-8950-production-scanner

    URL https://www.ricoh-usa. com/en/products/pd/equipment/scanners/ fi-8950-production-scanner . Accessed: 2026- 05-11. Isaac Robinson, Peter Robicheaux, Matvei Popov, Deva Ramanan, and Neehar Peri. RF-DETR: Neural architec- ture search for real-time detection transformers.arXiv preprint arXiv:2511.09554,

  13. [13]

    MMORE: Massive multimodal open RAG & extraction.arXiv preprint arXiv:2509.11937,

    9 Alexandre Sallinen, Stefan Krsteski, Paul Teiletche, Marc- Antoine Allard, Baptiste Lecoeur, Michael Zhang, Fab- rice Nemo, David Kalajdzic, Matthias Meyer, and Mary- Anne Hartley. MMORE: Massive multimodal open RAG & extraction.arXiv preprint arXiv:2509.11937,

  14. [14]

    OCR or not? rethinking document in- formation extraction in the MLLMs era with real-world large-scale datasets.arXiv preprint arXiv:2603.02789,

    Jiyuan Shen, Peiyue Yuan, Atin Ghosh, Yifan Mai, and Daniel Dahlmeier. OCR or not? rethinking document in- formation extraction in the MLLMs era with real-world large-scale datasets.arXiv preprint arXiv:2603.02789,

  15. [15]

    Hybrid OCR-LLM framework for enterprise-scale document information extraction under copy-heavy task.arXiv preprint arXiv:2510.10138,

    Zilong Wang and Xiaoyu Shen. Hybrid OCR-LLM framework for enterprise-scale document information extraction under copy-heavy task.arXiv preprint arXiv:2510.10138,