MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop
Pith reviewed 2026-05-20 14:20 UTC · model grok-4.3
The pith
Multi-agent AI pipeline with selective human validation automates 97% of document processing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MADP integrates a Classificator, Splitter, Parser, Extraction, and Validator agent with a Human-in-the-Loop mechanism and Prompt Fine Tuning with Feedback Inheritance to process documents. Production use on 955 documents yielded a 97.0% automation rate with 3% non-AI fallback, while ablation on 100 documents showed 98.5% accuracy. For 100,000 invoices yearly the system offers a 70% FTE reduction and 69% CO2, 69% energy, and 63% water savings over manual methods.
What carries the argument
The five-agent architecture (Classificator, Splitter, Parser, Extraction, Validator) combined with Human-in-the-Loop supervision and the Prompt Fine Tuning with Feedback Inheritance (PFTFI) method for refining LLM prompts based on feedback.
If this is right
- 70% reduction in full-time equivalent requirements for annual processing of 100,000 invoices.
- 69% lower CO2 emissions, 69% lower energy consumption, and 63% lower water usage compared to traditional manual processing.
- 97.0% full-pipeline automation rate achieved in production deployment on 955 real-world documents.
- 98.5% document-level accuracy with the full MADP configuration including human supervision.
Where Pith is reading between the lines
- Similar multi-agent setups with human oversight could apply to other document-heavy domains like legal contracts or medical records.
- Over time, the feedback inheritance mechanism may allow the system to reduce human involvement even further.
- The reported sustainability gains might encourage companies to prioritize hybrid AI systems for both cost and environmental reasons.
Load-bearing premise
The 100-document stratified ablation set represents the full range of the 100,000 annual invoices and human interventions stay at 3% without causing operational delays.
What would settle it
Processing an additional set of 1,000 invoices from new suppliers and measuring the actual percentage requiring human review, the resulting accuracy, and the real-world time and resource savings.
Figures
read the original abstract
Document processing automation remains a critical challenge in enterprise environments, where traditional manual approaches are labor-intensive and error-prone. We present MADP, a multi-agent architecture that addresses the challenge of automating document processing in enterprise settings by combining deep learning-based classification and parsing with large language model extraction, while maintaining accuracy through selective human validation. Our system integrates five specialized agents--Classificator, Splitter, Parser, Extraction, and Validator--with a Human-in-the-Loop (HITL) mechanism and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. The operational analysis on a production use-case scenario of 100,000 invoices per year indicates a potential reduction of Full-Time Equivalent (FTE) requirements by approximately 70%. Production deployment on 955 real-world documents processed through January 2026 achieves a 97.0% full-pipeline automation rate, with only 3% requiring non-AI fallback. Ablation evaluation on a stratified 100-document subset (5 documents per each of 20 supplier/document-type categories) demonstrates that the full MADP configuration with Human-in-the-Loop supervision attains 98.5% document-level accuracy. Additionally, we present a comprehensive sustainability analysis showing that our hybrid AI+HITL approach reduces CO2 emissions by 69%, energy consumption by 69%, and water usage by 63% compared to traditional manual processing. Benchmark comparisons of multiple LLM backends (Granite-Docling, Mistral-Small, DeepSeek-OCR) provide practical insights for deployment in production environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MADP, a multi-agent pipeline for enterprise document processing that combines deep learning-based classification and parsing with LLM-based extraction, five specialized agents (Classificator, Splitter, Parser, Extraction, Validator), a Human-in-the-Loop mechanism, and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. It reports empirical results from production deployment on 955 real-world documents achieving 97.0% full-pipeline automation (3% non-AI fallback), an ablation study on a stratified 100-document subset (5 documents per 20 supplier/document-type categories) attaining 98.5% document-level accuracy with HITL, a projected ~70% FTE reduction for an annual workload of 100,000 invoices, and sustainability gains (69% CO2, 69% energy, 63% water reduction vs. manual processing), plus benchmarks across LLM backends.
Significance. If the performance and sustainability claims can be substantiated with statistical rigor and workload representativeness, the work would offer a concrete case study of hybrid AI+HITL systems for high-volume document automation, with practical value in the reported production metrics, multi-LLM comparisons, and environmental impact analysis.
major comments (2)
- [Ablation evaluation (abstract and results section)] Ablation evaluation (abstract and results section): The claim that the stratified 100-document subset supports 98.5% accuracy and generalizes the 3% fallback rate to the full 100,000-invoice annual workload is load-bearing, yet no evidence is given that the 20 supplier/document-type categories match the empirical distribution in the production logs or that the 955 documents were randomly sampled rather than curated; this directly weakens the 70% FTE reduction projection.
- [Production deployment and sustainability analysis (abstract)] Production deployment and sustainability analysis (abstract): The reported 97.0% automation rate, 98.5% accuracy, and 69%/69%/63% sustainability reductions are presented without error bars, statistical significance tests, or detailed measurement protocols for accuracy and resource calculations, making the central empirical claims difficult to assess for reliability.
minor comments (2)
- [Introduction and method description] The PFTFI approach is introduced as novel but would benefit from earlier and more explicit definition of its feedback inheritance mechanism and how it differs from standard prompt tuning.
- [Results and tables] Any tables or figures presenting ablation or production metrics should include confidence intervals or variance measures to allow proper interpretation of the 98.5% accuracy figure.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps strengthen the empirical presentation of our work. We address each major comment below and have made revisions to incorporate additional details on sampling, representativeness, statistical measures, and measurement protocols.
read point-by-point responses
-
Referee: Ablation evaluation (abstract and results section): The claim that the stratified 100-document subset supports 98.5% accuracy and generalizes the 3% fallback rate to the full 100,000-invoice annual workload is load-bearing, yet no evidence is given that the 20 supplier/document-type categories match the empirical distribution in the production logs or that the 955 documents were randomly sampled rather than curated; this directly weakens the 70% FTE reduction projection.
Authors: We acknowledge the need for greater transparency on sampling and category selection. The 20 categories were derived from frequency analysis of production logs to capture the dominant supplier and document-type combinations observed in the operational data. The 955 documents constitute a consecutive production sample from January 2026 rather than a curated selection. We will revise the methods and results sections to explicitly describe this process, add a supplementary table summarizing category frequencies from the logs, and clarify that the FTE projection extrapolates the observed 97% automation rate from the full production deployment to the annual workload of 100,000 invoices. The ablation study provides supporting accuracy validation under the stratified design. We will also add a limitations paragraph noting that while the stratification targets representativeness, exact distributional matching across all possible categories would require further data collection. revision: yes
-
Referee: Production deployment and sustainability analysis (abstract): The reported 97.0% automation rate, 98.5% accuracy, and 69%/69%/63% sustainability reductions are presented without error bars, statistical significance tests, or detailed measurement protocols for accuracy and resource calculations, making the central empirical claims difficult to assess for reliability.
Authors: We agree that including statistical rigor and detailed protocols will improve assessability. In the revised manuscript we will report binomial confidence intervals for the 97.0% automation rate and 98.5% accuracy figures. We will expand the methods section with explicit protocols: accuracy is measured via document-level agreement against human-validated ground truth, and sustainability metrics are computed using standard energy-consumption models for the deployed hardware together with established emission factors for both compute and manual labor baselines. We will also add a brief discussion of the measurement approach and any assumptions underlying the 69%/69%/63% reductions. revision: yes
Circularity Check
No circularity: results are measured empirical outcomes from deployment and ablation
full rationale
The paper reports measured automation rates (97% on 955 production documents), accuracy (98.5% on stratified 100-document ablation), and a projected 70% FTE reduction from operational analysis of the 100k-invoice workload. These are direct observations and comparisons rather than quantities derived from equations that reduce to fitted parameters, self-definitions, or load-bearing self-citations. The multi-agent architecture and PFTFI method are described independently of the evaluation metrics, with no derivation chain that collapses predictions back to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 20 supplier/document-type categories used for stratification are representative of the full 100,000-invoice annual workload.
invented entities (1)
-
PFTFI (Prompt Fine Tuning with Feedback Inheritance)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MADP implements a specialized-module pipeline with five components... Classificator Agent uses a Convolutional Neural Network... Parser Agent... Docling library... Extraction Agent employs large language models... Validator Agent with PFTFI... sustainability analysis showing... 69% CO2 reduction
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Ablation study... Full MADP + HITL 98.5%... operational efficiency... 100,000 invoices per year
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Open voice interoperability specifications
David Attwater, Emmett Coin, Deborah Dahl, Leah Barnes, Allan Wylie, and Diego Gosmar. Open voice interoperability specifications. https://github.com/ open-voice-interoperability/docs/tree/main/specifications, 2024
work page 2024
-
[2]
Improving OCR using internal document redundancy
Diego Belzarena et al. Improving OCR using internal document redundancy. https://arxiv.org/abs/2508.14557, 2025
-
[3]
AI auditing: The broken bus on the road to AI accountability
Abeba Birhane, Ryan Steed, Victor Ojewale, Briana Vecchione, and Inioluwa Deb- orah Raji. AI auditing: The broken bus on the road to AI accountability. https://arxiv.org/abs/2401.14462, 2024
-
[4]
Language Models are Few-Shot Learners
Tom B. Brown et al. Language models are few-shot learners. https://arxiv.org/ abs/2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[5]
The carbon emissions of homework- ing and office working
Circular Ecology. The carbon emissions of homework- ing and office working. https://circularecology.com/news/ the-carbon-emissions-of-homeworking-and-office-working, 2023
work page 2023
- [6]
- [7]
-
[8]
Diego Gosmar and Deborah A. Dahl. Hallucination mitigation with agentic AI NLP- based open-floor standard. InProceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART), volume 5, pages 3893–3900. SciTePress, 2026.https://doi.org/10.5220/0013761000004052
- [9]
-
[10]
Diego Gosmar, Deborah A. Dahl, and Dario Gosmar. Prompt injection detection and mitigation via AI multi-agent NLP frameworks. https://arxiv.org/abs/ 2503.11517, 2025
-
[11]
Diego Gosmar, Deborah A. Dahl, and Dario Gosmar. Prompt injection detection and mitigation with AI multiagent NLP-based agentic frameworks. In2025 3rd International Conference on Foundation and Large Language Models (FLLM), pages 923–930. IEEE, 2025.https://doi.org/10.1109/FLLM67465.2025.11391215. MADP: Multi-Agent Document Processing 17
-
[12]
Greenly. Remote vs office: Which one is greener? https://greenly.earth/en-gb/ blog/company-guide/remote-vs-office--which-one-is-greener, 2024
work page 2024
-
[13]
Adam W. Harley et al. Evaluation of deep convolutional nets for document image classification and retrieval. InProceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), pages 991–995, 2015. https: //doi.org/10.1109/ICDAR.2015.7333910
-
[14]
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.https://arxiv.org/abs/1512.03385, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[15]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J¨ urgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. https://arxiv. org/abs/2308.00352, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Docling: Advanced document parsing library.https://github.com/DS4SD/docling, 2024
IBM Research Zurich and Linux Foundation AI & Data. Docling: Advanced document parsing library.https://github.com/DS4SD/docling, 2024
work page 2024
-
[17]
Donut: Document understanding transformer without OCR
Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. https://arxiv.org/abs/ 2111.15664, 2022
- [18]
-
[19]
Pengfei Li, Jianyi Yang, Mohammad A. Islam, and Shaolei Ren. Making ai less ”thirsty”: Uncovering and addressing the secret water footprint of ai models. https: //arxiv.org/abs/2304.03271, 2025
-
[20]
Jiale Liu et al. Memory-augmented agent training for business document under- standing.https://arxiv.org/abs/2412.15274, 2024
-
[21]
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
Yuliang Liu et al. OCRBench: On the hidden mystery of OCR in large multimodal models.https://arxiv.org/abs/2305.07895, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [22]
-
[23]
Eduardo Mosqueira-Rey et al. Human-in-the-loop machine learning: A state of the art.Artificial Intelligence Review, 56(4):3005–3054, 2023. https://doi.org/10. 1007/s10462-022-10246-w
work page 2023
-
[24]
George Nagy. Twenty years of document image analysis in PAMI.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):38–62, 2000. https://doi.org/10.1109/34.824820
-
[25]
Energy consumption in office buildings: Compara- tive analysis
OSWBZ. Energy consumption in office buildings: Compara- tive analysis. http://oswbz.org/wp-content/uploads/2017/03/ ENERGY-CONSUMPTION-IN-OFFICE-BUILDINGS.pdf, 2017
work page 2017
-
[26]
AI invoice processing benchmarks 2026: Accuracy, speed, and cost analysis
Parseur. AI invoice processing benchmarks 2026: Accuracy, speed, and cost analysis. https://parseur.com/blog/ai-invoice-processing-benchmarks, 2025
work page 2026
-
[27]
David Patterson et al. The carbon footprint of machine learning training will plateau, then shrink.IEEE Computer, 55(7):18–28, 2022. https://doi.org/10. 1109/MC.2022.3148714
-
[28]
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference,
Siddharth Samsi et al. From words to watts: Benchmarking the energy costs of large language model inference. InIEEE High Performance Extreme Computing Conference (HPEC), 2023.https://arxiv.org/abs/2310.03003. 18 D. Gosmar and G. Zenezini
-
[29]
Burr Settles. Active learning. https://doi.org/10.2200/ S00429ED1V01Y201207AIM018, 2012. Synthesis Lectures on Artificial Intelli- gence and Machine Learning, vol. 6, pp. 1–114
work page 2012
-
[30]
Energy and Policy Considerations for Deep Learning in NLP
Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. https://arxiv.org/abs/1906.02243, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[31]
AI invoice processing: Less errors and faster payments
Niels Tonsen. AI invoice processing: Less errors and faster payments. https: //www.turian.ai/blog/ai-invoice-processing, 2025. Turian Blog, May 5, 2025
work page 2025
-
[32]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.https://arxiv.org/abs/2308.08155, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
LayoutLM: Pre-training of text and layout for document image understanding
Yiheng Xu et al. LayoutLM: Pre-training of text and layout for document image understanding. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1192–1200, 2020. https: //arxiv.org/abs/1912.13318
-
[34]
ParseBench: A Document Parsing Benchmark for AI Agents
Boyang Zhang et al. ParseBench: A document parsing benchmark for AI agents. https://arxiv.org/abs/2604.08538, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, and Wentao Zhang. Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction. https: //arxiv.org/abs/2410.21169, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang et al. Siren’s song in the AI ocean: A survey on hallucination in large language models.https://arxiv.org/abs/2309.01219, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Zhiyuan Zhao et al. DocLayout-YOLO: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception. https: //arxiv.org/abs/2410.12628, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.