Vidya: An AI-Driven Modular Pipeline for Archival Automation and Semantic Metadata Enrichment

Cloter Migliorini Filho; Edson Armando Silva; Julia Graciela Machado; Marcella Scoczynski

arxiv: 2605.16338 · v1 · pith:SQM5QB5Hnew · submitted 2026-05-07 · 💻 cs.DL · cs.CL

Vidya: An AI-Driven Modular Pipeline for Archival Automation and Semantic Metadata Enrichment

Cloter Migliorini Filho , Julia Graciela Machado , Edson Armando Silva , Marcella Scoczynski This is my paper

Pith reviewed 2026-05-20 23:40 UTC · model grok-4.3

classification 💻 cs.DL cs.CL

keywords archival metadataLLM automationsemantic enrichmentdigital archivesYAML ontologiesPydantic validationopen source pipeline

0 comments

The pith

Vidya pipeline uses YAML ontologies and Pydantic validation to turn LLM outputs into deterministic archival metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Vidya, a modular pipeline designed to automate the creation of semantic metadata for large digitized historical archives. It tackles the issue of dark data by orchestrating large language models with free and open-source tools to enrich and ingest archival materials at scale. By defining ontologies in YAML and applying Pydantic validation, the system produces structured JSON outputs that meet standards such as NOBRADE and ISAD(G). This method enables low-cost implementation on modest hardware and reduces what would take decades of manual work to days.

Core claim

Vidya is an AI-driven modular pipeline that constrains generations from probabilistic large language models using YAML-defined ontologies and Pydantic validation, resulting in deterministic, structured JSON metadata compliant with archival standards.

What carries the argument

YAML-defined ontologies combined with Pydantic validation to constrain LLM outputs into reliable structured JSON.

If this is right

Processing time for archival metadata drops from decades to days for large collections.
Metadata generation becomes deterministic and standards-compliant without full manual intervention.
Memory institutions can deploy the system at low cost using open-source tools and modest hardware.
Semantic enrichment allows better discovery and reuse of historical digital objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar constraint techniques might apply to other fields needing structured outputs from AI, such as legal document processing.
Integration with existing archival software could further streamline workflows.
Over time, the reliance on AI for metadata could shift the role of archivists toward oversight and quality control.

Load-bearing premise

LLM outputs constrained by YAML ontologies and Pydantic validation will produce semantically accurate and archivally appropriate metadata at scale without substantial human correction or systematic errors.

What would settle it

Running Vidya on a test set of archival documents and comparing the generated metadata against that produced by expert human archivists for accuracy and compliance.

read the original abstract

The large-scale digitization of historical archives has created a paradox: "dark data"-digital objects lacking metadata for retrieval. Manual archival description is slow and expensive, limiting discovery and reuse. We propose Vidya, a modular pipeline that orchestrates Large Language Models (LLMs) and FOSS tools to automate semantic enrichment and archival ingestion at scale. Vidya constrains generations using YAML-defined ontologies and Pydantic validation, producing deterministic, structured JSON outputs from probabilistic models. Developed at Laboratory for Digital Humanities and Innovation (LAMUHDI) of the State University of Ponta Grossa (UEPG), Vidya applies Maker principles and open-source practices to enable low-cost deployment in memory institutions using modest hardware. We compare LLM performance and present a cost-benefit analysis showing major gains, reducing processing time from decades to days while complying with NOBRADE and ISAD(G).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vidya describes a modular LLM pipeline with YAML and Pydantic constraints for archival metadata but supplies no numbers or tests to support its time-saving and accuracy claims.

read the letter

Vidya is a proposed pipeline that chains LLMs with YAML ontologies and Pydantic validation to turn digitized archive material into structured, standards-compliant metadata. The authors position it as a low-cost, open-source option for institutions that cannot run large manual projects. That framing is the main takeaway: it takes existing LLM orchestration techniques and applies them to the specific constraints of archival description under NOBRADE and ISAD(G).

Referee Report

2 major / 1 minor

Summary. The paper introduces Vidya, a modular pipeline that combines LLMs with FOSS tools to automate semantic metadata enrichment and archival ingestion for digitized historical collections. It emphasizes the use of YAML-defined ontologies and Pydantic validation to constrain LLM outputs into deterministic, structured JSON that complies with NOBRADE and ISAD(G) standards, while claiming substantial reductions in processing time (from decades to days) and low-cost deployment on modest hardware.

Significance. If the quantitative claims and semantic accuracy hold, the work would address a practical bottleneck in digital archives by enabling scalable, low-resource metadata creation. The open-source, modular design and focus on memory institutions could have applied value for under-resourced collections, though the absence of supporting metrics limits assessment of real-world impact.

major comments (2)

[Abstract] Abstract: The central claims of 'major gains' and reduction of processing time 'from decades to days' while achieving compliance with NOBRADE and ISAD(G) are presented without any quantitative results, accuracy metrics, error rates, or details on how the LLM performance comparisons and cost-benefit analysis were conducted. This absence is load-bearing for the automation-at-scale claim.
[Abstract] The description of YAML-defined ontologies and Pydantic validation is presented as sufficient to produce 'semantically accurate' and archivally appropriate metadata from probabilistic LLMs. However, these mechanisms enforce syntactic structure and vocabulary but provide no evidence (e.g., ablation studies, human gold-standard comparisons, or failure-mode analysis on ambiguous documents) that they close the semantic gap or prevent systematic misinterpretation.

minor comments (1)

[Abstract] The abstract refers to 'We compare LLM performance' without specifying which models, baselines, or evaluation criteria are used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the presentation of our quantitative claims and the validation of semantic accuracy in Vidya. We address each major comment below and outline revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'major gains' and reduction of processing time 'from decades to days' while achieving compliance with NOBRADE and ISAD(G) are presented without any quantitative results, accuracy metrics, error rates, or details on how the LLM performance comparisons and cost-benefit analysis were conducted. This absence is load-bearing for the automation-at-scale claim.

Authors: We agree that the abstract would be strengthened by including specific quantitative metrics. The full manuscript includes a Results section with LLM performance comparisons (accuracy, error rates, and compliance with standards) and a cost-benefit analysis based on pilot processing of archival documents, which quantifies the observed reductions in manual effort. The comparisons involved running the pipeline on a test set of digitized historical records and measuring outputs against manual baselines. We will revise the abstract to summarize key figures from these analyses, such as processing time reductions and compliance rates, while adding a brief reference to the experimental setup. revision: yes
Referee: [Abstract] The description of YAML-defined ontologies and Pydantic validation is presented as sufficient to produce 'semantically accurate' and archivally appropriate metadata from probabilistic LLMs. However, these mechanisms enforce syntactic structure and vocabulary but provide no evidence (e.g., ablation studies, human gold-standard comparisons, or failure-mode analysis on ambiguous documents) that they close the semantic gap or prevent systematic misinterpretation.

Authors: We acknowledge that YAML ontologies and Pydantic validation primarily enforce syntactic structure, vocabulary adherence, and JSON schema compliance, with semantic guidance provided through ontology-informed prompting. The manuscript includes illustrative examples of generated metadata aligned with NOBRADE and ISAD(G). However, we agree that additional empirical support would better demonstrate mitigation of semantic errors. We will add a validation subsection with a human gold-standard comparison on a sample of documents and a discussion of observed failure modes on ambiguous cases, to provide evidence on the pipeline's effectiveness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; descriptive system architecture with no derivations or self-referential reductions

full rationale

The paper describes an engineering pipeline (Vidya) that applies YAML-defined ontologies and Pydantic validation to constrain LLM outputs for archival metadata. No equations, fitted parameters, uniqueness theorems, or mathematical derivations are present in the provided text. Claims about producing deterministic JSON, reducing processing time, and complying with NOBRADE/ISAD(G) are presented as direct consequences of the modular architecture and open-source tools rather than reductions to prior self-citations or inputs by construction. The work is self-contained as a practical implementation report without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that LLM generations can be made deterministic and archivally valid through external schema constraints alone.

axioms (1)

domain assumption LLM outputs can be reliably constrained to produce accurate semantic metadata using YAML-defined ontologies and Pydantic validation
Invoked in the pipeline description to achieve deterministic structured outputs from probabilistic models.

pith-pipeline@v0.9.0 · 5688 in / 1276 out tokens · 35518 ms · 2026-05-20T23:40:31.294228+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Vidya constrains generations using YAML-defined ontologies and Pydantic validation, producing deterministic, structured JSON outputs from probabilistic models... complying with NOBRADE and ISAD(G).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Vidya employs a “digital straitjacket” approach, using YAML-defined ontologies and Pydantic validation to enforce deterministic, structured JSON outputs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

Recursive Language Models

Recursive Language Models , author=. arXiv preprint arXiv:2512.24601 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

2024 , type=

Automating Data Extraction from Documents Using Large Language Models , author=. 2024 , type=

work page 2024
[3]

2006 , address=

NOBRADE: Norma Brasileira de Descrição Arquivística , author=. 2006 , address=

work page 2006
[4]

2000 , note=

ISAD(G): General International Standard Archival Description , author=. 2000 , note=

work page 2000
[5]

Advances in Neural Information Processing Systems , volume=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. Advances in Neural Information Processing Systems , volume=

work page
[6]

Unlocking Hidden Value: A Framework for Transforming Dark Data in Organizational Decision-Making , author=

work page
[7]

Archives and Manuscripts , year=

Archival infrastructure and the information backlog , author=. Archives and Manuscripts , year=

work page
[8]

Information Processing & Management , year=

AI‑driven Metadata Extraction and Semantic Search for Enhanced Archival Retrieval , author=. Information Processing & Management , year=

work page
[9]

2005 , note=

More Product, Less Process: Revamping Traditional Archival Processing , author=. 2005 , note=

work page 2005
[10]

ACM Computing Surveys , year=

A Survey on Hallucination in Large Language Models , author=. ACM Computing Surveys , year=

work page
[11]

2025 , note=

Web Archives Metadata Generation with GPT‑4o: Challenges and Insights , author=. 2025 , note=

work page 2025
[12]

arXiv , year=

aDORe: A modular, standards-based Digital Object Repository , author=. arXiv , year=

work page
[13]

Revista Digital de Biblioteconomia e Ciência da Informação , year=

Metadata standards in web archiving: Technological resources for ensuring the digital preservation of archived websites , author=. Revista Digital de Biblioteconomia e Ciência da Informação , year=

work page
[14]

2025 , note=

Tainacan: A flexible and powerful digital repository platform , author=. 2025 , note=

work page 2025
[15]

2026 , note=

DSpace: Open Source Digital Repository Software , author=. 2026 , note=

work page 2026
[16]

2025 , note=

Archivematica: Free Open‑Source Digital Preservation System , author=. 2025 , note=

work page 2025
[17]

2024 , journal=

Automatic Metadata Generation in Historical Archives Using Deep Learning , author=. 2024 , journal=

work page 2024
[18]

2023 , journal=

Neural Network Approaches to Archival Metadata Generation , author=. 2023 , journal=

work page 2023
[19]

2025 , journal=

Challenges of Applying AI in Archival Processes: Addressing Hallucinations and Output Inconsistencies , author=. 2025 , journal=

work page 2025
[20]

2023 , journal=

Challenges in Metadata Generation for Digital Archives Using LLMs , author=. 2023 , journal=

work page 2023
[21]

2024 , note=

Tainacan: A Flexible Repository Platform for Digital Archiving , author=. 2024 , note=

work page 2024
[22]

2024 , journal=

Archivematica: A Comprehensive Framework for Digital Preservation , author=. 2024 , journal=

work page 2024
[23]

2024 , journal=

The Evolution of DSpace: Modernizing Archival Practices with Machine Learning , author=. 2024 , journal=

work page 2024
[24]

2023 , journal=

Modular Approaches to Digital Repositories: Enhancing Flexibility and Scalability , author=. 2023 , journal=

work page 2023
[25]

2019 , note=

Metadata Encoding and Transmission Standard (METS) , author=. 2019 , note=

work page 2019
[26]

2021 , note=

PREMIS Data Dictionary for Preservation Metadata , author=. 2021 , note=

work page 2021
[27]

2025 , note=

Pydantic: Data model validation in Python , author=. 2025 , note=

work page 2025
[28]

2024 , note=

Pydantic‑YAML: YAML support for Pydantic models , author=. 2024 , note=

work page 2024
[29]

2026 , note =

Open WebUI: A user-friendly WebUI for LLMs , howpublished=. 2026 , note =

work page 2026
[30]

The American Archivist , volume=

More product, less process: Revamping traditional archival processing , author=. The American Archivist , volume=. 2005 , publisher=

work page 2005
[31]

Archival records and training in the age of big data , author=

work page

[1] [1]

Recursive Language Models

Recursive Language Models , author=. arXiv preprint arXiv:2512.24601 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

2024 , type=

Automating Data Extraction from Documents Using Large Language Models , author=. 2024 , type=

work page 2024

[3] [3]

2006 , address=

NOBRADE: Norma Brasileira de Descrição Arquivística , author=. 2006 , address=

work page 2006

[4] [4]

2000 , note=

ISAD(G): General International Standard Archival Description , author=. 2000 , note=

work page 2000

[5] [5]

Advances in Neural Information Processing Systems , volume=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. Advances in Neural Information Processing Systems , volume=

work page

[6] [6]

Unlocking Hidden Value: A Framework for Transforming Dark Data in Organizational Decision-Making , author=

work page

[7] [7]

Archives and Manuscripts , year=

Archival infrastructure and the information backlog , author=. Archives and Manuscripts , year=

work page

[8] [8]

Information Processing & Management , year=

AI‑driven Metadata Extraction and Semantic Search for Enhanced Archival Retrieval , author=. Information Processing & Management , year=

work page

[9] [9]

2005 , note=

More Product, Less Process: Revamping Traditional Archival Processing , author=. 2005 , note=

work page 2005

[10] [10]

ACM Computing Surveys , year=

A Survey on Hallucination in Large Language Models , author=. ACM Computing Surveys , year=

work page

[11] [11]

2025 , note=

Web Archives Metadata Generation with GPT‑4o: Challenges and Insights , author=. 2025 , note=

work page 2025

[12] [12]

arXiv , year=

aDORe: A modular, standards-based Digital Object Repository , author=. arXiv , year=

work page

[13] [13]

Revista Digital de Biblioteconomia e Ciência da Informação , year=

Metadata standards in web archiving: Technological resources for ensuring the digital preservation of archived websites , author=. Revista Digital de Biblioteconomia e Ciência da Informação , year=

work page

[14] [14]

2025 , note=

Tainacan: A flexible and powerful digital repository platform , author=. 2025 , note=

work page 2025

[15] [15]

2026 , note=

DSpace: Open Source Digital Repository Software , author=. 2026 , note=

work page 2026

[16] [16]

2025 , note=

Archivematica: Free Open‑Source Digital Preservation System , author=. 2025 , note=

work page 2025

[17] [17]

2024 , journal=

Automatic Metadata Generation in Historical Archives Using Deep Learning , author=. 2024 , journal=

work page 2024

[18] [18]

2023 , journal=

Neural Network Approaches to Archival Metadata Generation , author=. 2023 , journal=

work page 2023

[19] [19]

2025 , journal=

Challenges of Applying AI in Archival Processes: Addressing Hallucinations and Output Inconsistencies , author=. 2025 , journal=

work page 2025

[20] [20]

2023 , journal=

Challenges in Metadata Generation for Digital Archives Using LLMs , author=. 2023 , journal=

work page 2023

[21] [21]

2024 , note=

Tainacan: A Flexible Repository Platform for Digital Archiving , author=. 2024 , note=

work page 2024

[22] [22]

2024 , journal=

Archivematica: A Comprehensive Framework for Digital Preservation , author=. 2024 , journal=

work page 2024

[23] [23]

2024 , journal=

The Evolution of DSpace: Modernizing Archival Practices with Machine Learning , author=. 2024 , journal=

work page 2024

[24] [24]

2023 , journal=

Modular Approaches to Digital Repositories: Enhancing Flexibility and Scalability , author=. 2023 , journal=

work page 2023

[25] [25]

2019 , note=

Metadata Encoding and Transmission Standard (METS) , author=. 2019 , note=

work page 2019

[26] [26]

2021 , note=

PREMIS Data Dictionary for Preservation Metadata , author=. 2021 , note=

work page 2021

[27] [27]

2025 , note=

Pydantic: Data model validation in Python , author=. 2025 , note=

work page 2025

[28] [28]

2024 , note=

Pydantic‑YAML: YAML support for Pydantic models , author=. 2024 , note=

work page 2024

[29] [29]

2026 , note =

Open WebUI: A user-friendly WebUI for LLMs , howpublished=. 2026 , note =

work page 2026

[30] [30]

The American Archivist , volume=

More product, less process: Revamping traditional archival processing , author=. The American Archivist , volume=. 2005 , publisher=

work page 2005

[31] [31]

Archival records and training in the age of big data , author=

work page