MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

Diego Gosmar; Giovanni Zenezini

arxiv: 2605.17159 · v1 · pith:774MOTUSnew · submitted 2026-05-16 · 💻 cs.AI · cs.MA

MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

Diego Gosmar , Giovanni Zenezini This is my paper

Pith reviewed 2026-05-20 14:20 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords multi-agent systemsdocument processinghuman-in-the-loopsustainabilityinvoice automationlarge language modelsAI agentsenterprise AI

0 comments

The pith

Multi-agent AI pipeline with selective human validation automates 97% of document processing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce MADP, a pipeline that deploys five AI agents to classify, split, parse, extract from, and validate documents, calling on humans only for difficult cases. In tests with real invoices, this setup automated 97% of the flow and reached 98.5% accuracy on a test subset. The approach is projected to cut staff time by 70% for large annual volumes while also lowering the carbon, energy, and water costs of the work by 60 to 70 percent. A reader would care because it demonstrates a working hybrid system that scales enterprise document handling without full manual effort or full automation risks.

Core claim

MADP integrates a Classificator, Splitter, Parser, Extraction, and Validator agent with a Human-in-the-Loop mechanism and Prompt Fine Tuning with Feedback Inheritance to process documents. Production use on 955 documents yielded a 97.0% automation rate with 3% non-AI fallback, while ablation on 100 documents showed 98.5% accuracy. For 100,000 invoices yearly the system offers a 70% FTE reduction and 69% CO2, 69% energy, and 63% water savings over manual methods.

What carries the argument

The five-agent architecture (Classificator, Splitter, Parser, Extraction, Validator) combined with Human-in-the-Loop supervision and the Prompt Fine Tuning with Feedback Inheritance (PFTFI) method for refining LLM prompts based on feedback.

If this is right

70% reduction in full-time equivalent requirements for annual processing of 100,000 invoices.
69% lower CO2 emissions, 69% lower energy consumption, and 63% lower water usage compared to traditional manual processing.
97.0% full-pipeline automation rate achieved in production deployment on 955 real-world documents.
98.5% document-level accuracy with the full MADP configuration including human supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multi-agent setups with human oversight could apply to other document-heavy domains like legal contracts or medical records.
Over time, the feedback inheritance mechanism may allow the system to reduce human involvement even further.
The reported sustainability gains might encourage companies to prioritize hybrid AI systems for both cost and environmental reasons.

Load-bearing premise

The 100-document stratified ablation set represents the full range of the 100,000 annual invoices and human interventions stay at 3% without causing operational delays.

What would settle it

Processing an additional set of 1,000 invoices from new suppliers and measuring the actual percentage requiring human review, the resulting accuracy, and the real-world time and resource savings.

Figures

Figures reproduced from arXiv: 2605.17159 by Diego Gosmar, Giovanni Zenezini.

**Figure 1.** Figure 1: MADP pipeline: five sequential components (Classificator, Splitter, Parser, Extraction, Validator) with PFTFI feedback loop. 3.2 Classificator Agent The Classificator Agent uses a Convolutional Neural Network trained to identify document types and supplier categories. For invoice processing, the CNN classifies documents by supplier, enabling supplier-specific extraction templates. The CNN architecture uses… view at source ↗

**Figure 2.** Figure 2: PFTFI mechanism: human corrections update extraction prompts and propagate to pending documents without model retraining. 3.7 Human Validation Interface The validation interface presents each document side by side with the corresponding extracted metadata in structured JSON format, enabling efficient human inspection and correction. The design highlights fields with low confidence scores, provides compact… view at source ↗

**Figure 3.** Figure 3: Ablation study: incremental accuracy gains per component. Parser Agent contributes the largest improvement (+17.5 pp). 4.4 LLM Backend Comparison [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: LLM backend comparison: Mistral-Small-3.2 leads in F1 (92.9%) and recall (98.2%); DeepSeek-OCR leads in precision (96.8%). 4.5 Operational Efficiency Analysis To assess the scalability and economic viability of MADP, we project operational metrics for a representative enterprise use-case scenario of 100,000 invoices per year, based on the observed performance characteristics from the 955-document productio… view at source ↗

**Figure 5.** Figure 5: Three-scenario comparison: AI+HITL achieves best accuracy (98.5%), lowest CO2 (5.4 tons/year), with 7 FTEs. 5 Sustainability Impact Analysis We present the first comprehensive sustainability analysis of AI-assisted document processing, quantifying environmental metrics across three scenarios and applying the use-case scenario of 100,000 invoices per year to quantify environmental impact at enterprise scale… view at source ↗

read the original abstract

Document processing automation remains a critical challenge in enterprise environments, where traditional manual approaches are labor-intensive and error-prone. We present MADP, a multi-agent architecture that addresses the challenge of automating document processing in enterprise settings by combining deep learning-based classification and parsing with large language model extraction, while maintaining accuracy through selective human validation. Our system integrates five specialized agents--Classificator, Splitter, Parser, Extraction, and Validator--with a Human-in-the-Loop (HITL) mechanism and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. The operational analysis on a production use-case scenario of 100,000 invoices per year indicates a potential reduction of Full-Time Equivalent (FTE) requirements by approximately 70%. Production deployment on 955 real-world documents processed through January 2026 achieves a 97.0% full-pipeline automation rate, with only 3% requiring non-AI fallback. Ablation evaluation on a stratified 100-document subset (5 documents per each of 20 supplier/document-type categories) demonstrates that the full MADP configuration with Human-in-the-Loop supervision attains 98.5% document-level accuracy. Additionally, we present a comprehensive sustainability analysis showing that our hybrid AI+HITL approach reduces CO2 emissions by 69%, energy consumption by 69%, and water usage by 63% compared to traditional manual processing. Benchmark comparisons of multiple LLM backends (Granite-Docling, Mistral-Small, DeepSeek-OCR) provide practical insights for deployment in production environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MADP shows a working five-agent pipeline for invoice automation with real production numbers at 97% full automation, but the 100-document stratified ablation needs more scrutiny on whether it represents the full workload.

read the letter

Hi, the main thing to know is that this paper describes a concrete multi-agent system for high-volume invoice processing that combines classification, splitting, parsing, LLM extraction, and validation with human oversight, plus a feedback-based prompt tuning method they call PFTFI. In a production run on 955 documents it reaches 97% full-pipeline automation with only 3% needing non-AI fallback, and an ablation on 100 stratified documents hits 98.5% accuracy while projecting a 70% drop in full-time equivalents and sizable cuts in CO2, energy, and water use compared with manual work. They also benchmark a few LLM backends for practical guidance. That is the useful part: actual deployment metrics and an ablation that tests the full configuration rather than just isolated components. The sustainability numbers and the direct LLM comparisons give it some applied value for anyone running similar enterprise workflows. The soft spot is the evaluation design. The ablation set takes five documents from each of twenty supplier and type categories, but there is no clear evidence in the abstract that those categories match the actual distribution or edge cases in the 100,000-invoice annual load. If the 955 production documents were not randomly sampled or if the ablation under-represents difficult cases, both the automation rate and the labor-reduction claim could look different at scale. The abstract also skips error bars, statistical tests, and the exact measurement protocols for accuracy and the environmental calculations, so those details will determine how much weight the numbers can carry. This is aimed at practitioners building document-automation pipelines rather than theorists. It has enough real deployment data and a clear architecture to deserve a serious referee, though the review will probably focus on sampling, measurement transparency, and whether the ablation generalizes. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MADP, a multi-agent pipeline for enterprise document processing that combines deep learning-based classification and parsing with LLM-based extraction, five specialized agents (Classificator, Splitter, Parser, Extraction, Validator), a Human-in-the-Loop mechanism, and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. It reports empirical results from production deployment on 955 real-world documents achieving 97.0% full-pipeline automation (3% non-AI fallback), an ablation study on a stratified 100-document subset (5 documents per 20 supplier/document-type categories) attaining 98.5% document-level accuracy with HITL, a projected ~70% FTE reduction for an annual workload of 100,000 invoices, and sustainability gains (69% CO2, 69% energy, 63% water reduction vs. manual processing), plus benchmarks across LLM backends.

Significance. If the performance and sustainability claims can be substantiated with statistical rigor and workload representativeness, the work would offer a concrete case study of hybrid AI+HITL systems for high-volume document automation, with practical value in the reported production metrics, multi-LLM comparisons, and environmental impact analysis.

major comments (2)

[Ablation evaluation (abstract and results section)] Ablation evaluation (abstract and results section): The claim that the stratified 100-document subset supports 98.5% accuracy and generalizes the 3% fallback rate to the full 100,000-invoice annual workload is load-bearing, yet no evidence is given that the 20 supplier/document-type categories match the empirical distribution in the production logs or that the 955 documents were randomly sampled rather than curated; this directly weakens the 70% FTE reduction projection.
[Production deployment and sustainability analysis (abstract)] Production deployment and sustainability analysis (abstract): The reported 97.0% automation rate, 98.5% accuracy, and 69%/69%/63% sustainability reductions are presented without error bars, statistical significance tests, or detailed measurement protocols for accuracy and resource calculations, making the central empirical claims difficult to assess for reliability.

minor comments (2)

[Introduction and method description] The PFTFI approach is introduced as novel but would benefit from earlier and more explicit definition of its feedback inheritance mechanism and how it differs from standard prompt tuning.
[Results and tables] Any tables or figures presenting ablation or production metrics should include confidence intervals or variance measures to allow proper interpretation of the 98.5% accuracy figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the empirical presentation of our work. We address each major comment below and have made revisions to incorporate additional details on sampling, representativeness, statistical measures, and measurement protocols.

read point-by-point responses

Referee: Ablation evaluation (abstract and results section): The claim that the stratified 100-document subset supports 98.5% accuracy and generalizes the 3% fallback rate to the full 100,000-invoice annual workload is load-bearing, yet no evidence is given that the 20 supplier/document-type categories match the empirical distribution in the production logs or that the 955 documents were randomly sampled rather than curated; this directly weakens the 70% FTE reduction projection.

Authors: We acknowledge the need for greater transparency on sampling and category selection. The 20 categories were derived from frequency analysis of production logs to capture the dominant supplier and document-type combinations observed in the operational data. The 955 documents constitute a consecutive production sample from January 2026 rather than a curated selection. We will revise the methods and results sections to explicitly describe this process, add a supplementary table summarizing category frequencies from the logs, and clarify that the FTE projection extrapolates the observed 97% automation rate from the full production deployment to the annual workload of 100,000 invoices. The ablation study provides supporting accuracy validation under the stratified design. We will also add a limitations paragraph noting that while the stratification targets representativeness, exact distributional matching across all possible categories would require further data collection. revision: yes
Referee: Production deployment and sustainability analysis (abstract): The reported 97.0% automation rate, 98.5% accuracy, and 69%/69%/63% sustainability reductions are presented without error bars, statistical significance tests, or detailed measurement protocols for accuracy and resource calculations, making the central empirical claims difficult to assess for reliability.

Authors: We agree that including statistical rigor and detailed protocols will improve assessability. In the revised manuscript we will report binomial confidence intervals for the 97.0% automation rate and 98.5% accuracy figures. We will expand the methods section with explicit protocols: accuracy is measured via document-level agreement against human-validated ground truth, and sustainability metrics are computed using standard energy-consumption models for the deployed hardware together with established emission factors for both compute and manual labor baselines. We will also add a brief discussion of the measurement approach and any assumptions underlying the 69%/69%/63% reductions. revision: yes

Circularity Check

0 steps flagged

No circularity: results are measured empirical outcomes from deployment and ablation

full rationale

The paper reports measured automation rates (97% on 955 production documents), accuracy (98.5% on stratified 100-document ablation), and a projected 70% FTE reduction from operational analysis of the 100k-invoice workload. These are direct observations and comparisons rather than quantities derived from equations that reduce to fitted parameters, self-definitions, or load-bearing self-citations. The multi-agent architecture and PFTFI method are described independently of the evaluation metrics, with no derivation chain that collapses predictions back to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central performance claims rest on the assumption that the chosen 20-category stratification captures production variability and that LLM backends behave consistently across the tested suppliers. No explicit free parameters or new physical entities are introduced.

axioms (1)

domain assumption The 20 supplier/document-type categories used for stratification are representative of the full 100,000-invoice annual workload.
Ablation study design in abstract selects 5 documents per category.

invented entities (1)

PFTFI (Prompt Fine Tuning with Feedback Inheritance) no independent evidence
purpose: Mechanism to incorporate human corrections into future prompts for improved extraction.
Presented as a novel component of the MADP pipeline.

pith-pipeline@v0.9.0 · 5815 in / 1453 out tokens · 45453 ms · 2026-05-20T14:20:50.285861+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MADP implements a specialized-module pipeline with five components... Classificator Agent uses a Convolutional Neural Network... Parser Agent... Docling library... Extraction Agent employs large language models... Validator Agent with PFTFI... sustainability analysis showing... 69% CO2 reduction
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ablation study... Full MADP + HITL 98.5%... operational efficiency... 100,000 invoices per year

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 9 internal anchors

[1]

Open voice interoperability specifications

David Attwater, Emmett Coin, Deborah Dahl, Leah Barnes, Allan Wylie, and Diego Gosmar. Open voice interoperability specifications. https://github.com/ open-voice-interoperability/docs/tree/main/specifications, 2024

work page 2024
[2]

Improving OCR using internal document redundancy

Diego Belzarena et al. Improving OCR using internal document redundancy. https://arxiv.org/abs/2508.14557, 2025

work page arXiv 2025
[3]

AI auditing: The broken bus on the road to AI accountability

Abeba Birhane, Ryan Steed, Victor Ojewale, Briana Vecchione, and Inioluwa Deb- orah Raji. AI auditing: The broken bus on the road to AI accountability. https://arxiv.org/abs/2401.14462, 2024

work page arXiv 2024
[4]

Language Models are Few-Shot Learners

Tom B. Brown et al. Language models are few-shot learners. https://arxiv.org/ abs/2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[5]

The carbon emissions of homework- ing and office working

Circular Ecology. The carbon emissions of homework- ing and office working. https://circularecology.com/news/ the-carbon-emissions-of-homeworking-and-office-working, 2023

work page 2023
[6]

Diego Gosmar and Deborah A. Dahl. Hallucination mitigation using agentic AI natural language-based frameworks.https://arxiv.org/abs/2501.13946, 2025

work page arXiv 2025
[7]

Diego Gosmar and Deborah A. Dahl. Sentinel agents for secure and trustworthy agentic AI in multi-agent systems.https://arxiv.org/abs/2509.14956, 2025

work page arXiv 2025
[8]

Diego Gosmar and Deborah A. Dahl. Hallucination mitigation with agentic AI NLP- based open-floor standard. InProceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART), volume 5, pages 3893–3900. SciTePress, 2026.https://doi.org/10.5220/0013761000004052

work page doi:10.5220/0013761000004052 2026
[9]

Diego Gosmar and Deborah A. Dahl. Prompt injection mitigation with agentic AI, nested learning, and AI sustainability via semantic caching. https://arxiv.org/ abs/2601.13186, 2026

work page arXiv 2026
[10]

Dahl, and Dario Gosmar

Diego Gosmar, Deborah A. Dahl, and Dario Gosmar. Prompt injection detection and mitigation via AI multi-agent NLP frameworks. https://arxiv.org/abs/ 2503.11517, 2025

work page arXiv 2025
[11]

Dahl, and Dario Gosmar

Diego Gosmar, Deborah A. Dahl, and Dario Gosmar. Prompt injection detection and mitigation with AI multiagent NLP-based agentic frameworks. In2025 3rd International Conference on Foundation and Large Language Models (FLLM), pages 923–930. IEEE, 2025.https://doi.org/10.1109/FLLM67465.2025.11391215. MADP: Multi-Agent Document Processing 17

work page doi:10.1109/fllm67465.2025.11391215 2025
[12]

Remote vs office: Which one is greener? https://greenly.earth/en-gb/ blog/company-guide/remote-vs-office--which-one-is-greener, 2024

Greenly. Remote vs office: Which one is greener? https://greenly.earth/en-gb/ blog/company-guide/remote-vs-office--which-one-is-greener, 2024

work page 2024
[13]

Harley et al

Adam W. Harley et al. Evaluation of deep convolutional nets for document image classification and retrieval. InProceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), pages 991–995, 2015. https: //doi.org/10.1109/ICDAR.2015.7333910

work page doi:10.1109/icdar.2015.7333910 2015
[14]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.https://arxiv.org/abs/1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J¨ urgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. https://arxiv. org/abs/2308.00352, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Docling: Advanced document parsing library.https://github.com/DS4SD/docling, 2024

IBM Research Zurich and Linux Foundation AI & Data. Docling: Advanced document parsing library.https://github.com/DS4SD/docling, 2024

work page 2024
[17]

Donut: Document understanding transformer without OCR

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. https://arxiv.org/abs/ 2111.15664, 2022

work page arXiv 2022
[18]

Siddhant Kulkarni and Yukta Kulkarni. Benchmarking multi-agent LLM archi- tectures for financial document processing: A comparative study of orchestra- tion patterns, cost-accuracy tradeoffs and production scaling strategies. https: //arxiv.org/abs/2603.22651, 2026

work page arXiv 2026
[19]

Islam, and Shaolei Ren

Pengfei Li, Jianyi Yang, Mohammad A. Islam, and Shaolei Ren. Making ai less ”thirsty”: Uncovering and addressing the secret water footprint of ai models. https: //arxiv.org/abs/2304.03271, 2025

work page arXiv 2025
[20]

Memory-augmented agent training for business document under- standing.https://arxiv.org/abs/2412.15274, 2024

Jiale Liu et al. Memory-augmented agent training for business document under- standing.https://arxiv.org/abs/2412.15274, 2024

work page arXiv 2024
[21]

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

Yuliang Liu et al. OCRBench: On the hidden mystery of OCR in large multimodal models.https://arxiv.org/abs/2305.07895, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Alexandra Luccioni et al. Power hungry processing: Watts driving the cost of AI deployment? InACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 85–99, 2023.https://arxiv.org/abs/2311.16863

work page arXiv 2023
[23]

Human-in-the-loop machine learning: A state of the art.Artificial Intelligence Review, 56(4):3005–3054, 2023

Eduardo Mosqueira-Rey et al. Human-in-the-loop machine learning: A state of the art.Artificial Intelligence Review, 56(4):3005–3054, 2023. https://doi.org/10. 1007/s10462-022-10246-w

work page 2023
[24]

Twenty years of document image analysis in PAMI.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):38–62, 2000

George Nagy. Twenty years of document image analysis in PAMI.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):38–62, 2000. https://doi.org/10.1109/34.824820

work page doi:10.1109/34.824820 2000
[25]

Energy consumption in office buildings: Compara- tive analysis

OSWBZ. Energy consumption in office buildings: Compara- tive analysis. http://oswbz.org/wp-content/uploads/2017/03/ ENERGY-CONSUMPTION-IN-OFFICE-BUILDINGS.pdf, 2017

work page 2017
[26]

AI invoice processing benchmarks 2026: Accuracy, speed, and cost analysis

Parseur. AI invoice processing benchmarks 2026: Accuracy, speed, and cost analysis. https://parseur.com/blog/ai-invoice-processing-benchmarks, 2025

work page 2026
[27]

The carbon footprint of machine learning training will plateau, then shrink.IEEE Computer, 55(7):18–28, 2022

David Patterson et al. The carbon footprint of machine learning training will plateau, then shrink.IEEE Computer, 55(7):18–28, 2022. https://doi.org/10. 1109/MC.2022.3148714

work page arXiv 2022
[28]

From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference,

Siddharth Samsi et al. From words to watts: Benchmarking the energy costs of large language model inference. InIEEE High Performance Extreme Computing Conference (HPEC), 2023.https://arxiv.org/abs/2310.03003. 18 D. Gosmar and G. Zenezini

work page arXiv 2023
[29]

Active learning

Burr Settles. Active learning. https://doi.org/10.2200/ S00429ED1V01Y201207AIM018, 2012. Synthesis Lectures on Artificial Intelli- gence and Machine Learning, vol. 6, pp. 1–114

work page 2012
[30]

Energy and Policy Considerations for Deep Learning in NLP

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. https://arxiv.org/abs/1906.02243, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[31]

AI invoice processing: Less errors and faster payments

Niels Tonsen. AI invoice processing: Less errors and faster payments. https: //www.turian.ai/blog/ai-invoice-processing, 2025. Turian Blog, May 5, 2025

work page 2025
[32]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.https://arxiv.org/abs/2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

LayoutLM: Pre-training of text and layout for document image understanding

Yiheng Xu et al. LayoutLM: Pre-training of text and layout for document image understanding. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1192–1200, 2020. https: //arxiv.org/abs/1912.13318

work page arXiv 2020
[34]

ParseBench: A Document Parsing Benchmark for AI Agents

Boyang Zhang et al. ParseBench: A document parsing benchmark for AI agents. https://arxiv.org/abs/2604.08538, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, and Wentao Zhang. Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction. https: //arxiv.org/abs/2410.21169, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang et al. Siren’s song in the AI ocean: A survey on hallucination in large language models.https://arxiv.org/abs/2309.01219, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

relevant_bbox_ids

Zhiyuan Zhao et al. DocLayout-YOLO: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception. https: //arxiv.org/abs/2410.12628, 2024

work page arXiv 2024

[1] [1]

Open voice interoperability specifications

David Attwater, Emmett Coin, Deborah Dahl, Leah Barnes, Allan Wylie, and Diego Gosmar. Open voice interoperability specifications. https://github.com/ open-voice-interoperability/docs/tree/main/specifications, 2024

work page 2024

[2] [2]

Improving OCR using internal document redundancy

Diego Belzarena et al. Improving OCR using internal document redundancy. https://arxiv.org/abs/2508.14557, 2025

work page arXiv 2025

[3] [3]

AI auditing: The broken bus on the road to AI accountability

Abeba Birhane, Ryan Steed, Victor Ojewale, Briana Vecchione, and Inioluwa Deb- orah Raji. AI auditing: The broken bus on the road to AI accountability. https://arxiv.org/abs/2401.14462, 2024

work page arXiv 2024

[4] [4]

Language Models are Few-Shot Learners

Tom B. Brown et al. Language models are few-shot learners. https://arxiv.org/ abs/2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[5] [5]

The carbon emissions of homework- ing and office working

Circular Ecology. The carbon emissions of homework- ing and office working. https://circularecology.com/news/ the-carbon-emissions-of-homeworking-and-office-working, 2023

work page 2023

[6] [6]

Diego Gosmar and Deborah A. Dahl. Hallucination mitigation using agentic AI natural language-based frameworks.https://arxiv.org/abs/2501.13946, 2025

work page arXiv 2025

[7] [7]

Diego Gosmar and Deborah A. Dahl. Sentinel agents for secure and trustworthy agentic AI in multi-agent systems.https://arxiv.org/abs/2509.14956, 2025

work page arXiv 2025

[8] [8]

Diego Gosmar and Deborah A. Dahl. Hallucination mitigation with agentic AI NLP- based open-floor standard. InProceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART), volume 5, pages 3893–3900. SciTePress, 2026.https://doi.org/10.5220/0013761000004052

work page doi:10.5220/0013761000004052 2026

[9] [9]

Diego Gosmar and Deborah A. Dahl. Prompt injection mitigation with agentic AI, nested learning, and AI sustainability via semantic caching. https://arxiv.org/ abs/2601.13186, 2026

work page arXiv 2026

[10] [10]

Dahl, and Dario Gosmar

Diego Gosmar, Deborah A. Dahl, and Dario Gosmar. Prompt injection detection and mitigation via AI multi-agent NLP frameworks. https://arxiv.org/abs/ 2503.11517, 2025

work page arXiv 2025

[11] [11]

Dahl, and Dario Gosmar

Diego Gosmar, Deborah A. Dahl, and Dario Gosmar. Prompt injection detection and mitigation with AI multiagent NLP-based agentic frameworks. In2025 3rd International Conference on Foundation and Large Language Models (FLLM), pages 923–930. IEEE, 2025.https://doi.org/10.1109/FLLM67465.2025.11391215. MADP: Multi-Agent Document Processing 17

work page doi:10.1109/fllm67465.2025.11391215 2025

[12] [12]

Remote vs office: Which one is greener? https://greenly.earth/en-gb/ blog/company-guide/remote-vs-office--which-one-is-greener, 2024

Greenly. Remote vs office: Which one is greener? https://greenly.earth/en-gb/ blog/company-guide/remote-vs-office--which-one-is-greener, 2024

work page 2024

[13] [13]

Harley et al

Adam W. Harley et al. Evaluation of deep convolutional nets for document image classification and retrieval. InProceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), pages 991–995, 2015. https: //doi.org/10.1109/ICDAR.2015.7333910

work page doi:10.1109/icdar.2015.7333910 2015

[14] [14]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.https://arxiv.org/abs/1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[15] [15]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J¨ urgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. https://arxiv. org/abs/2308.00352, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Docling: Advanced document parsing library.https://github.com/DS4SD/docling, 2024

IBM Research Zurich and Linux Foundation AI & Data. Docling: Advanced document parsing library.https://github.com/DS4SD/docling, 2024

work page 2024

[17] [17]

Donut: Document understanding transformer without OCR

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. https://arxiv.org/abs/ 2111.15664, 2022

work page arXiv 2022

[18] [18]

Siddhant Kulkarni and Yukta Kulkarni. Benchmarking multi-agent LLM archi- tectures for financial document processing: A comparative study of orchestra- tion patterns, cost-accuracy tradeoffs and production scaling strategies. https: //arxiv.org/abs/2603.22651, 2026

work page arXiv 2026

[19] [19]

Islam, and Shaolei Ren

Pengfei Li, Jianyi Yang, Mohammad A. Islam, and Shaolei Ren. Making ai less ”thirsty”: Uncovering and addressing the secret water footprint of ai models. https: //arxiv.org/abs/2304.03271, 2025

work page arXiv 2025

[20] [20]

Memory-augmented agent training for business document under- standing.https://arxiv.org/abs/2412.15274, 2024

Jiale Liu et al. Memory-augmented agent training for business document under- standing.https://arxiv.org/abs/2412.15274, 2024

work page arXiv 2024

[21] [21]

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

Yuliang Liu et al. OCRBench: On the hidden mystery of OCR in large multimodal models.https://arxiv.org/abs/2305.07895, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Alexandra Luccioni et al. Power hungry processing: Watts driving the cost of AI deployment? InACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 85–99, 2023.https://arxiv.org/abs/2311.16863

work page arXiv 2023

[23] [23]

Human-in-the-loop machine learning: A state of the art.Artificial Intelligence Review, 56(4):3005–3054, 2023

Eduardo Mosqueira-Rey et al. Human-in-the-loop machine learning: A state of the art.Artificial Intelligence Review, 56(4):3005–3054, 2023. https://doi.org/10. 1007/s10462-022-10246-w

work page 2023

[24] [24]

Twenty years of document image analysis in PAMI.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):38–62, 2000

George Nagy. Twenty years of document image analysis in PAMI.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):38–62, 2000. https://doi.org/10.1109/34.824820

work page doi:10.1109/34.824820 2000

[25] [25]

Energy consumption in office buildings: Compara- tive analysis

OSWBZ. Energy consumption in office buildings: Compara- tive analysis. http://oswbz.org/wp-content/uploads/2017/03/ ENERGY-CONSUMPTION-IN-OFFICE-BUILDINGS.pdf, 2017

work page 2017

[26] [26]

AI invoice processing benchmarks 2026: Accuracy, speed, and cost analysis

Parseur. AI invoice processing benchmarks 2026: Accuracy, speed, and cost analysis. https://parseur.com/blog/ai-invoice-processing-benchmarks, 2025

work page 2026

[27] [27]

The carbon footprint of machine learning training will plateau, then shrink.IEEE Computer, 55(7):18–28, 2022

David Patterson et al. The carbon footprint of machine learning training will plateau, then shrink.IEEE Computer, 55(7):18–28, 2022. https://doi.org/10. 1109/MC.2022.3148714

work page arXiv 2022

[28] [28]

From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference,

Siddharth Samsi et al. From words to watts: Benchmarking the energy costs of large language model inference. InIEEE High Performance Extreme Computing Conference (HPEC), 2023.https://arxiv.org/abs/2310.03003. 18 D. Gosmar and G. Zenezini

work page arXiv 2023

[29] [29]

Active learning

Burr Settles. Active learning. https://doi.org/10.2200/ S00429ED1V01Y201207AIM018, 2012. Synthesis Lectures on Artificial Intelli- gence and Machine Learning, vol. 6, pp. 1–114

work page 2012

[30] [30]

Energy and Policy Considerations for Deep Learning in NLP

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. https://arxiv.org/abs/1906.02243, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[31] [31]

AI invoice processing: Less errors and faster payments

Niels Tonsen. AI invoice processing: Less errors and faster payments. https: //www.turian.ai/blog/ai-invoice-processing, 2025. Turian Blog, May 5, 2025

work page 2025

[32] [32]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.https://arxiv.org/abs/2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

LayoutLM: Pre-training of text and layout for document image understanding

Yiheng Xu et al. LayoutLM: Pre-training of text and layout for document image understanding. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1192–1200, 2020. https: //arxiv.org/abs/1912.13318

work page arXiv 2020

[34] [34]

ParseBench: A Document Parsing Benchmark for AI Agents

Boyang Zhang et al. ParseBench: A document parsing benchmark for AI agents. https://arxiv.org/abs/2604.08538, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, and Wentao Zhang. Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction. https: //arxiv.org/abs/2410.21169, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang et al. Siren’s song in the AI ocean: A survey on hallucination in large language models.https://arxiv.org/abs/2309.01219, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

relevant_bbox_ids

Zhiyuan Zhao et al. DocLayout-YOLO: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception. https: //arxiv.org/abs/2410.12628, 2024

work page arXiv 2024