pith. sign in

arxiv: 2605.17159 · v1 · pith:774MOTUSnew · submitted 2026-05-16 · 💻 cs.AI · cs.MA

MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

Pith reviewed 2026-05-20 14:20 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords multi-agent systemsdocument processinghuman-in-the-loopsustainabilityinvoice automationlarge language modelsAI agentsenterprise AI
0
0 comments X

The pith

Multi-agent AI pipeline with selective human validation automates 97% of document processing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce MADP, a pipeline that deploys five AI agents to classify, split, parse, extract from, and validate documents, calling on humans only for difficult cases. In tests with real invoices, this setup automated 97% of the flow and reached 98.5% accuracy on a test subset. The approach is projected to cut staff time by 70% for large annual volumes while also lowering the carbon, energy, and water costs of the work by 60 to 70 percent. A reader would care because it demonstrates a working hybrid system that scales enterprise document handling without full manual effort or full automation risks.

Core claim

MADP integrates a Classificator, Splitter, Parser, Extraction, and Validator agent with a Human-in-the-Loop mechanism and Prompt Fine Tuning with Feedback Inheritance to process documents. Production use on 955 documents yielded a 97.0% automation rate with 3% non-AI fallback, while ablation on 100 documents showed 98.5% accuracy. For 100,000 invoices yearly the system offers a 70% FTE reduction and 69% CO2, 69% energy, and 63% water savings over manual methods.

What carries the argument

The five-agent architecture (Classificator, Splitter, Parser, Extraction, Validator) combined with Human-in-the-Loop supervision and the Prompt Fine Tuning with Feedback Inheritance (PFTFI) method for refining LLM prompts based on feedback.

If this is right

  • 70% reduction in full-time equivalent requirements for annual processing of 100,000 invoices.
  • 69% lower CO2 emissions, 69% lower energy consumption, and 63% lower water usage compared to traditional manual processing.
  • 97.0% full-pipeline automation rate achieved in production deployment on 955 real-world documents.
  • 98.5% document-level accuracy with the full MADP configuration including human supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar multi-agent setups with human oversight could apply to other document-heavy domains like legal contracts or medical records.
  • Over time, the feedback inheritance mechanism may allow the system to reduce human involvement even further.
  • The reported sustainability gains might encourage companies to prioritize hybrid AI systems for both cost and environmental reasons.

Load-bearing premise

The 100-document stratified ablation set represents the full range of the 100,000 annual invoices and human interventions stay at 3% without causing operational delays.

What would settle it

Processing an additional set of 1,000 invoices from new suppliers and measuring the actual percentage requiring human review, the resulting accuracy, and the real-world time and resource savings.

Figures

Figures reproduced from arXiv: 2605.17159 by Diego Gosmar, Giovanni Zenezini.

Figure 1
Figure 1. Figure 1: MADP pipeline: five sequential components (Classificator, Splitter, Parser, Extraction, Validator) with PFTFI feedback loop. 3.2 Classificator Agent The Classificator Agent uses a Convolutional Neural Network trained to identify document types and supplier categories. For invoice processing, the CNN classifies documents by supplier, enabling supplier-specific extraction templates. The CNN architecture uses… view at source ↗
Figure 2
Figure 2. Figure 2: PFTFI mechanism: human corrections update extraction prompts and propagate to pending documents without model retraining. 3.7 Human Validation Interface The validation interface presents each document side by side with the correspond￾ing extracted metadata in structured JSON format, enabling efficient human inspection and correction. The design highlights fields with low confidence scores, provides compact… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study: incremental accuracy gains per component. Parser Agent contributes the largest improvement (+17.5 pp). 4.4 LLM Backend Comparison [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LLM backend comparison: Mistral-Small-3.2 leads in F1 (92.9%) and recall (98.2%); DeepSeek-OCR leads in precision (96.8%). 4.5 Operational Efficiency Analysis To assess the scalability and economic viability of MADP, we project operational metrics for a representative enterprise use-case scenario of 100,000 invoices per year, based on the observed performance characteristics from the 955-document productio… view at source ↗
Figure 5
Figure 5. Figure 5: Three-scenario comparison: AI+HITL achieves best accuracy (98.5%), lowest CO2 (5.4 tons/year), with 7 FTEs. 5 Sustainability Impact Analysis We present the first comprehensive sustainability analysis of AI-assisted document processing, quantifying environmental metrics across three scenarios and applying the use-case scenario of 100,000 invoices per year to quantify environmental impact at enterprise scale… view at source ↗
read the original abstract

Document processing automation remains a critical challenge in enterprise environments, where traditional manual approaches are labor-intensive and error-prone. We present MADP, a multi-agent architecture that addresses the challenge of automating document processing in enterprise settings by combining deep learning-based classification and parsing with large language model extraction, while maintaining accuracy through selective human validation. Our system integrates five specialized agents--Classificator, Splitter, Parser, Extraction, and Validator--with a Human-in-the-Loop (HITL) mechanism and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. The operational analysis on a production use-case scenario of 100,000 invoices per year indicates a potential reduction of Full-Time Equivalent (FTE) requirements by approximately 70%. Production deployment on 955 real-world documents processed through January 2026 achieves a 97.0% full-pipeline automation rate, with only 3% requiring non-AI fallback. Ablation evaluation on a stratified 100-document subset (5 documents per each of 20 supplier/document-type categories) demonstrates that the full MADP configuration with Human-in-the-Loop supervision attains 98.5% document-level accuracy. Additionally, we present a comprehensive sustainability analysis showing that our hybrid AI+HITL approach reduces CO2 emissions by 69%, energy consumption by 69%, and water usage by 63% compared to traditional manual processing. Benchmark comparisons of multiple LLM backends (Granite-Docling, Mistral-Small, DeepSeek-OCR) provide practical insights for deployment in production environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MADP, a multi-agent pipeline for enterprise document processing that combines deep learning-based classification and parsing with LLM-based extraction, five specialized agents (Classificator, Splitter, Parser, Extraction, Validator), a Human-in-the-Loop mechanism, and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. It reports empirical results from production deployment on 955 real-world documents achieving 97.0% full-pipeline automation (3% non-AI fallback), an ablation study on a stratified 100-document subset (5 documents per 20 supplier/document-type categories) attaining 98.5% document-level accuracy with HITL, a projected ~70% FTE reduction for an annual workload of 100,000 invoices, and sustainability gains (69% CO2, 69% energy, 63% water reduction vs. manual processing), plus benchmarks across LLM backends.

Significance. If the performance and sustainability claims can be substantiated with statistical rigor and workload representativeness, the work would offer a concrete case study of hybrid AI+HITL systems for high-volume document automation, with practical value in the reported production metrics, multi-LLM comparisons, and environmental impact analysis.

major comments (2)
  1. [Ablation evaluation (abstract and results section)] Ablation evaluation (abstract and results section): The claim that the stratified 100-document subset supports 98.5% accuracy and generalizes the 3% fallback rate to the full 100,000-invoice annual workload is load-bearing, yet no evidence is given that the 20 supplier/document-type categories match the empirical distribution in the production logs or that the 955 documents were randomly sampled rather than curated; this directly weakens the 70% FTE reduction projection.
  2. [Production deployment and sustainability analysis (abstract)] Production deployment and sustainability analysis (abstract): The reported 97.0% automation rate, 98.5% accuracy, and 69%/69%/63% sustainability reductions are presented without error bars, statistical significance tests, or detailed measurement protocols for accuracy and resource calculations, making the central empirical claims difficult to assess for reliability.
minor comments (2)
  1. [Introduction and method description] The PFTFI approach is introduced as novel but would benefit from earlier and more explicit definition of its feedback inheritance mechanism and how it differs from standard prompt tuning.
  2. [Results and tables] Any tables or figures presenting ablation or production metrics should include confidence intervals or variance measures to allow proper interpretation of the 98.5% accuracy figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the empirical presentation of our work. We address each major comment below and have made revisions to incorporate additional details on sampling, representativeness, statistical measures, and measurement protocols.

read point-by-point responses
  1. Referee: Ablation evaluation (abstract and results section): The claim that the stratified 100-document subset supports 98.5% accuracy and generalizes the 3% fallback rate to the full 100,000-invoice annual workload is load-bearing, yet no evidence is given that the 20 supplier/document-type categories match the empirical distribution in the production logs or that the 955 documents were randomly sampled rather than curated; this directly weakens the 70% FTE reduction projection.

    Authors: We acknowledge the need for greater transparency on sampling and category selection. The 20 categories were derived from frequency analysis of production logs to capture the dominant supplier and document-type combinations observed in the operational data. The 955 documents constitute a consecutive production sample from January 2026 rather than a curated selection. We will revise the methods and results sections to explicitly describe this process, add a supplementary table summarizing category frequencies from the logs, and clarify that the FTE projection extrapolates the observed 97% automation rate from the full production deployment to the annual workload of 100,000 invoices. The ablation study provides supporting accuracy validation under the stratified design. We will also add a limitations paragraph noting that while the stratification targets representativeness, exact distributional matching across all possible categories would require further data collection. revision: yes

  2. Referee: Production deployment and sustainability analysis (abstract): The reported 97.0% automation rate, 98.5% accuracy, and 69%/69%/63% sustainability reductions are presented without error bars, statistical significance tests, or detailed measurement protocols for accuracy and resource calculations, making the central empirical claims difficult to assess for reliability.

    Authors: We agree that including statistical rigor and detailed protocols will improve assessability. In the revised manuscript we will report binomial confidence intervals for the 97.0% automation rate and 98.5% accuracy figures. We will expand the methods section with explicit protocols: accuracy is measured via document-level agreement against human-validated ground truth, and sustainability metrics are computed using standard energy-consumption models for the deployed hardware together with established emission factors for both compute and manual labor baselines. We will also add a brief discussion of the measurement approach and any assumptions underlying the 69%/69%/63% reductions. revision: yes

Circularity Check

0 steps flagged

No circularity: results are measured empirical outcomes from deployment and ablation

full rationale

The paper reports measured automation rates (97% on 955 production documents), accuracy (98.5% on stratified 100-document ablation), and a projected 70% FTE reduction from operational analysis of the 100k-invoice workload. These are direct observations and comparisons rather than quantities derived from equations that reduce to fitted parameters, self-definitions, or load-bearing self-citations. The multi-agent architecture and PFTFI method are described independently of the evaluation metrics, with no derivation chain that collapses predictions back to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central performance claims rest on the assumption that the chosen 20-category stratification captures production variability and that LLM backends behave consistently across the tested suppliers. No explicit free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption The 20 supplier/document-type categories used for stratification are representative of the full 100,000-invoice annual workload.
    Ablation study design in abstract selects 5 documents per category.
invented entities (1)
  • PFTFI (Prompt Fine Tuning with Feedback Inheritance) no independent evidence
    purpose: Mechanism to incorporate human corrections into future prompts for improved extraction.
    Presented as a novel component of the MADP pipeline.

pith-pipeline@v0.9.0 · 5815 in / 1453 out tokens · 45453 ms · 2026-05-20T14:20:50.285861+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 9 internal anchors

  1. [1]

    Open voice interoperability specifications

    David Attwater, Emmett Coin, Deborah Dahl, Leah Barnes, Allan Wylie, and Diego Gosmar. Open voice interoperability specifications. https://github.com/ open-voice-interoperability/docs/tree/main/specifications, 2024

  2. [2]

    Improving OCR using internal document redundancy

    Diego Belzarena et al. Improving OCR using internal document redundancy. https://arxiv.org/abs/2508.14557, 2025

  3. [3]

    AI auditing: The broken bus on the road to AI accountability

    Abeba Birhane, Ryan Steed, Victor Ojewale, Briana Vecchione, and Inioluwa Deb- orah Raji. AI auditing: The broken bus on the road to AI accountability. https://arxiv.org/abs/2401.14462, 2024

  4. [4]

    Language Models are Few-Shot Learners

    Tom B. Brown et al. Language models are few-shot learners. https://arxiv.org/ abs/2005.14165, 2020

  5. [5]

    The carbon emissions of homework- ing and office working

    Circular Ecology. The carbon emissions of homework- ing and office working. https://circularecology.com/news/ the-carbon-emissions-of-homeworking-and-office-working, 2023

  6. [6]

    Diego Gosmar and Deborah A. Dahl. Hallucination mitigation using agentic AI natural language-based frameworks.https://arxiv.org/abs/2501.13946, 2025

  7. [7]

    Diego Gosmar and Deborah A. Dahl. Sentinel agents for secure and trustworthy agentic AI in multi-agent systems.https://arxiv.org/abs/2509.14956, 2025

  8. [8]

    Diego Gosmar and Deborah A. Dahl. Hallucination mitigation with agentic AI NLP- based open-floor standard. InProceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART), volume 5, pages 3893–3900. SciTePress, 2026.https://doi.org/10.5220/0013761000004052

  9. [9]

    Diego Gosmar and Deborah A. Dahl. Prompt injection mitigation with agentic AI, nested learning, and AI sustainability via semantic caching. https://arxiv.org/ abs/2601.13186, 2026

  10. [10]

    Dahl, and Dario Gosmar

    Diego Gosmar, Deborah A. Dahl, and Dario Gosmar. Prompt injection detection and mitigation via AI multi-agent NLP frameworks. https://arxiv.org/abs/ 2503.11517, 2025

  11. [11]

    Dahl, and Dario Gosmar

    Diego Gosmar, Deborah A. Dahl, and Dario Gosmar. Prompt injection detection and mitigation with AI multiagent NLP-based agentic frameworks. In2025 3rd International Conference on Foundation and Large Language Models (FLLM), pages 923–930. IEEE, 2025.https://doi.org/10.1109/FLLM67465.2025.11391215. MADP: Multi-Agent Document Processing 17

  12. [12]

    Remote vs office: Which one is greener? https://greenly.earth/en-gb/ blog/company-guide/remote-vs-office--which-one-is-greener, 2024

    Greenly. Remote vs office: Which one is greener? https://greenly.earth/en-gb/ blog/company-guide/remote-vs-office--which-one-is-greener, 2024

  13. [13]

    Harley et al

    Adam W. Harley et al. Evaluation of deep convolutional nets for document image classification and retrieval. InProceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), pages 991–995, 2015. https: //doi.org/10.1109/ICDAR.2015.7333910

  14. [14]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.https://arxiv.org/abs/1512.03385, 2015

  15. [15]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J¨ urgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. https://arxiv. org/abs/2308.00352, 2024

  16. [16]

    Docling: Advanced document parsing library.https://github.com/DS4SD/docling, 2024

    IBM Research Zurich and Linux Foundation AI & Data. Docling: Advanced document parsing library.https://github.com/DS4SD/docling, 2024

  17. [17]

    Donut: Document understanding transformer without OCR

    Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. https://arxiv.org/abs/ 2111.15664, 2022

  18. [18]

    Siddhant Kulkarni and Yukta Kulkarni. Benchmarking multi-agent LLM archi- tectures for financial document processing: A comparative study of orchestra- tion patterns, cost-accuracy tradeoffs and production scaling strategies. https: //arxiv.org/abs/2603.22651, 2026

  19. [19]

    Islam, and Shaolei Ren

    Pengfei Li, Jianyi Yang, Mohammad A. Islam, and Shaolei Ren. Making ai less ”thirsty”: Uncovering and addressing the secret water footprint of ai models. https: //arxiv.org/abs/2304.03271, 2025

  20. [20]

    Memory-augmented agent training for business document under- standing.https://arxiv.org/abs/2412.15274, 2024

    Jiale Liu et al. Memory-augmented agent training for business document under- standing.https://arxiv.org/abs/2412.15274, 2024

  21. [21]

    OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

    Yuliang Liu et al. OCRBench: On the hidden mystery of OCR in large multimodal models.https://arxiv.org/abs/2305.07895, 2023

  22. [22]

    Alexandra Luccioni et al. Power hungry processing: Watts driving the cost of AI deployment? InACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 85–99, 2023.https://arxiv.org/abs/2311.16863

  23. [23]

    Human-in-the-loop machine learning: A state of the art.Artificial Intelligence Review, 56(4):3005–3054, 2023

    Eduardo Mosqueira-Rey et al. Human-in-the-loop machine learning: A state of the art.Artificial Intelligence Review, 56(4):3005–3054, 2023. https://doi.org/10. 1007/s10462-022-10246-w

  24. [24]

    Twenty years of document image analysis in PAMI.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):38–62, 2000

    George Nagy. Twenty years of document image analysis in PAMI.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):38–62, 2000. https://doi.org/10.1109/34.824820

  25. [25]

    Energy consumption in office buildings: Compara- tive analysis

    OSWBZ. Energy consumption in office buildings: Compara- tive analysis. http://oswbz.org/wp-content/uploads/2017/03/ ENERGY-CONSUMPTION-IN-OFFICE-BUILDINGS.pdf, 2017

  26. [26]

    AI invoice processing benchmarks 2026: Accuracy, speed, and cost analysis

    Parseur. AI invoice processing benchmarks 2026: Accuracy, speed, and cost analysis. https://parseur.com/blog/ai-invoice-processing-benchmarks, 2025

  27. [27]

    The carbon footprint of machine learning training will plateau, then shrink.IEEE Computer, 55(7):18–28, 2022

    David Patterson et al. The carbon footprint of machine learning training will plateau, then shrink.IEEE Computer, 55(7):18–28, 2022. https://doi.org/10. 1109/MC.2022.3148714

  28. [28]

    From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference,

    Siddharth Samsi et al. From words to watts: Benchmarking the energy costs of large language model inference. InIEEE High Performance Extreme Computing Conference (HPEC), 2023.https://arxiv.org/abs/2310.03003. 18 D. Gosmar and G. Zenezini

  29. [29]

    Active learning

    Burr Settles. Active learning. https://doi.org/10.2200/ S00429ED1V01Y201207AIM018, 2012. Synthesis Lectures on Artificial Intelli- gence and Machine Learning, vol. 6, pp. 1–114

  30. [30]

    Energy and Policy Considerations for Deep Learning in NLP

    Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. https://arxiv.org/abs/1906.02243, 2019

  31. [31]

    AI invoice processing: Less errors and faster payments

    Niels Tonsen. AI invoice processing: Less errors and faster payments. https: //www.turian.ai/blog/ai-invoice-processing, 2025. Turian Blog, May 5, 2025

  32. [32]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.https://arxiv.org/abs/2308.08155, 2023

  33. [33]

    LayoutLM: Pre-training of text and layout for document image understanding

    Yiheng Xu et al. LayoutLM: Pre-training of text and layout for document image understanding. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1192–1200, 2020. https: //arxiv.org/abs/1912.13318

  34. [34]

    ParseBench: A Document Parsing Benchmark for AI Agents

    Boyang Zhang et al. ParseBench: A document parsing benchmark for AI agents. https://arxiv.org/abs/2604.08538, 2026

  35. [35]

    Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

    Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, and Wentao Zhang. Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction. https: //arxiv.org/abs/2410.21169, 2026

  36. [36]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Yue Zhang et al. Siren’s song in the AI ocean: A survey on hallucination in large language models.https://arxiv.org/abs/2309.01219, 2025

  37. [37]

    relevant_bbox_ids

    Zhiyuan Zhao et al. DocLayout-YOLO: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception. https: //arxiv.org/abs/2410.12628, 2024