LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

Joshua Castillo; Ravi Mukkamala

arxiv: 2604.06571 · v1 · submitted 2026-04-08 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

Joshua Castillo , Ravi Mukkamala This is my paper

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG

keywords missing personsdata extractionlarge language modelsschema validationinformation extractionOCRinvestigative documents

0 comments

The pith

An LLM-guided parser turns scattered missing-person documents into reliable schema-compliant data with higher accuracy than rule-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Guardian Parser Pack, a pipeline that pulls information from mixed PDFs, posters, and web profiles about missing persons and converts it into one standard format. It combines text extraction tools, source-specific rules, schema checks, and an optional LLM step that repairs outputs to fit the schema. On 75 manually checked cases the LLM route reached an F1 of 0.8664 for extraction quality while the deterministic route scored 0.2578; across 517 records it also filled more key fields. The deterministic route stayed much faster, but every LLM output passed validation. The authors argue this shows probabilistic AI can be used safely inside an auditable, schema-first system for high-stakes investigative work.

Core claim

The Guardian Parser Pack converts heterogeneous missing-person documents into a unified schema-compliant representation, and its LLM-assisted extraction pathway delivers substantially higher extraction quality (F1 0.8664 versus 0.2578) and key-field completeness (96.97% versus 93.23%) than a deterministic comparator while keeping all outputs schema-valid.

What carries the argument

The LLM-assisted extraction pathway with validator-guided repair, integrated into a multi-engine PDF extractor, rule-based source parsers, and schema-first harmonization.

If this is right

Better extraction quality supports more accurate spatial modeling and search planning in missing-person cases.
Schema validation keeps outputs auditable even when an LLM is used.
The deterministic pathway can handle bulk initial processing while the LLM pathway refines difficult records.
All LLM outputs passing validation in the test run shows the repair step acts as a built-in safeguard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A hybrid system that routes only uncertain records to the LLM could keep most of the quality gain while reducing average runtime.
The same schema-guided approach could be tested on other domains that combine narrative reports with structured forms, such as legal or medical records.
Long-term operational value depends on whether the completeness gains translate into measurable improvements in real investigation outcomes.

Load-bearing premise

The manually aligned gold standard correctly captures every piece of true information in the source documents, and schema-validated LLM outputs are reliable enough for operational use without extra human review.

What would settle it

A collection of documents where the LLM pathway returns schema-valid but factually wrong values that the gold standard does not contain, or where higher completeness scores fail to improve actual search-planning or triage decisions.

Figures

Figures reproduced from arXiv: 2604.06571 by Joshua Castillo, Ravi Mukkamala.

**Figure 2.** Figure 2: Core Parsing and Reasoning Pipeline (Extraction, Harmonization, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example Missing Person Document-Page 1 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Example Missing Person Document-Page 2 The draft record is then standardized so that the same concepts are represented in the same way across all sources (sanitization and harmonization). If the document provides only a human-readable place string, the system fills in missing coordinates from that string (geocoding). This matters because spatial modeling and mapping require numeric latitude/longitude, s… view at source ↗

**Figure 6.** Figure 6: Rule-based Parser Path Output for the Example Document [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling. The proposed system integrates (i) multi-engine PDF text extraction with Optical Character Recognition (OCR) fallback, (ii) rule-based source identification with source-specific parsers, (iii) schema-first harmonization and validation, and (iv) an optional Large Language Model (LLM)-assisted extraction pathway incorporating validator-guided repair and shared geocoding services. We present the system architecture, key implementation decisions, and output design, and evaluate performance using both gold-aligned extraction metrics and corpus-level operational indicators. On a manually aligned subset of 75 cases, the LLM-assisted pathway achieved substantially higher extraction quality than the deterministic comparator (F1 = 0.8664 vs. 0.2578), while across 517 parsed records per pathway it also improved aggregate key-field completeness (96.97\% vs. 93.23\%). The deterministic pathway remained much faster (mean runtime 0.03 s/record vs. 3.95 s/record for the LLM pathway). In the evaluated run, all LLM outputs passed initial schema validation, so validator-guided repair functioned as a built-in safeguard rather than a contributor to the observed gains. These results support controlled use of probabilistic AI within a schema-first, auditable pipeline for high-stakes investigative settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical schema-first pipeline for extracting structured data from messy missing-person documents and shows solid F1 gains over a deterministic baseline, but the gold-standard alignment lacks enough detail to fully trust the comparison.

read the letter

This paper describes the Guardian Parser Pack, a pipeline that combines PDF extraction, OCR, rule-based parsers, schema validation, and an optional LLM path to turn heterogeneous missing-person records into normalized fields. The main contribution is the end-to-end application to this narrow but important domain, with built-in auditability and validation steps that make sense for investigative work. They report clear numbers: on 75 manually aligned cases the LLM-assisted version reaches F1 0.8664 versus 0.2578 for the deterministic comparator, and across 517 records it lifts key-field completeness from 93.23% to 96.97% while noting the speed cost. The architecture choices and the fact that schema validation passed on all LLM outputs are useful details for anyone trying to control probabilistic components in high-stakes settings. The emphasis on shared geocoding and validator-guided repair also shows thoughtful engineering for operational use. The evaluation is empirical rather than circular, which is a plus. The soft spot is the gold standard itself. The abstract says the 75 cases were manually aligned but gives no protocol, no inter-annotator numbers, and no discussion of how conflicts across document types were resolved. Since the deterministic path already struggles with layout and terminology variation, any systematic misalignment would hit the reported delta hardest. The completeness metric only tracks whether fields are present, not whether they are correct, so it is weaker evidence. These gaps are real but not fatal for an applied systems paper; they just mean the quantitative claims need more supporting description. This work is for practitioners building tools for law enforcement or child-safety data workflows rather than for core NLP researchers. A reader who wants a concrete example of wrapping LLMs inside schema constraints for messy real documents will find the architecture and trade-off discussion worthwhile. It deserves peer review because the results are concrete, the problem is well-motivated, and the system is reproducible enough on the surface to let referees ask the right questions about the alignment process and error analysis.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Guardian Parser Pack, a schema-guided pipeline for extracting and normalizing intelligence from heterogeneous missing-person documents (structured forms, posters, web profiles). It combines multi-engine PDF/OCR extraction, rule-based source identification and parsers, schema-first harmonization/validation, and an optional LLM-assisted extraction pathway with validator-guided repair and geocoding. On a manually aligned 75-case subset the LLM pathway reports F1=0.8664 versus 0.2578 for the deterministic comparator; across 517 records per pathway it reports higher key-field completeness (96.97% vs. 93.23%) while the deterministic path remains faster (0.03 s vs. 3.95 s per record). All LLM outputs passed schema validation in the evaluated run.

Significance. If the gold-standard alignment is reliable, the work offers a concrete, auditable demonstration that LLM assistance can materially improve extraction quality from variable investigative documents while preserving schema compliance and traceability. The empirical, non-circular evaluation design and the emphasis on operational indicators (completeness, runtime) are strengths that could support controlled deployment in high-stakes settings.

major comments (2)

[Evaluation] Evaluation section (abstract and results): The protocol for manually aligning the 75-case gold standard is not described—no details on annotator qualifications, inter-annotator agreement, or resolution of ambiguities (e.g., conflicting ages/locations across poster vs. narrative). Because the headline F1 gap (0.8664 vs. 0.2578) rests entirely on this reference, the absence of these details is load-bearing for the central performance claim.
[Results] Results section: The 517-record completeness figures (96.97% vs. 93.23%) measure only field presence, not correctness against an external reference. This metric therefore provides weaker support for the claim of overall superiority than the 75-case F1 comparison.

minor comments (1)

[Abstract] Abstract: No variance or distribution is reported for the runtime figures, which would help readers assess operational consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and indicate the changes we will incorporate in the revised version.

read point-by-point responses

Referee: [Evaluation] Evaluation section (abstract and results): The protocol for manually aligning the 75-case gold standard is not described—no details on annotator qualifications, inter-annotator agreement, or resolution of ambiguities (e.g., conflicting ages/locations across poster vs. narrative). Because the headline F1 gap (0.8664 vs. 0.2578) rests entirely on this reference, the absence of these details is load-bearing for the central performance claim.

Authors: We agree that the manuscript lacks a sufficient description of the gold-standard alignment process, which is required to substantiate the F1 results. In the revised manuscript we will add a dedicated paragraph in the Evaluation section that describes the alignment protocol: the 75 cases were aligned by a single domain expert (one of the authors with prior experience in law-enforcement data curation) using a fixed template that maps source fields to the target schema. Ambiguities such as conflicting ages or locations were resolved by preferring the most recent official form over posters or web profiles. We will explicitly note that inter-annotator agreement was not computed because alignment was performed by a single annotator; this limitation will be stated. These additions will make the evaluation protocol transparent and address the referee’s concern about the load-bearing nature of the reference. revision: yes
Referee: [Results] Results section: The 517-record completeness figures (96.97% vs. 93.23%) measure only field presence, not correctness against an external reference. This metric therefore provides weaker support for the claim of overall superiority than the 75-case F1 comparison.

Authors: We agree that the completeness figures reflect only field presence and not factual correctness. This metric is therefore weaker evidence of extraction quality than the F1 scores on the aligned subset. In the revised Results section we will explicitly qualify the completeness numbers as an operational indicator of coverage and schema compliance across the full corpus, while clarifying that they do not substitute for accuracy assessment. We will retain the metric because it demonstrates a practical operational benefit, but we will subordinate it to the F1 comparison and add a sentence acknowledging its limitations. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical evaluation of extraction pipeline

full rationale

The paper describes a schema-guided parsing system and reports direct empirical measurements: F1 scores on a 75-case manually aligned subset and key-field completeness across 517 records. These quantities are computed against external reference data and corpus aggregates rather than derived from any internal equations, fitted parameters, or self-referential definitions. No derivation chain, ansatz, uniqueness theorem, or self-citation load-bearing step is present that would reduce the claimed performance to quantities defined by the system itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the utility of a unified schema for harmonizing heterogeneous documents and the representativeness of the evaluated cases; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption A single unified schema can adequately represent and validate data extracted from diverse missing-person document formats
The pipeline is built around schema-first harmonization, validation, and LLM-guided repair.

invented entities (1)

Guardian Parser Pack no independent evidence
purpose: Integrated AI-driven parsing and normalization pipeline for investigative documents
The system is presented as a new end-to-end solution combining the listed components.

pith-pipeline@v0.9.0 · 5614 in / 1412 out tokens · 45153 ms · 2026-05-10T18:33:10.901785+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-path document-to-schema pipeline... schema-first harmonization and validation... LLM-assisted extraction pathway
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

F1 = 0.8664 vs. 0.2578 on 75-case gold-aligned subset

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Bielska, N

A. Bielska, N. R. Kurz, Y . Baumgartner, and V . Benetis,Open Source Intelligence Tools and Resources Handbook, 2020th ed. i-intelligence, 2020

work page 2020
[2]

Open source intelligence on the internet – categorisation and evaluation of search tools,

D. Mider, “Open source intelligence on the internet – categorisation and evaluation of search tools,”Internal Security Review, vol. 31, pp. 383–412, 2024

work page 2024
[3]

Applying machine learning and data fusion to the “missing person

K. M. A. Solaiman, T. Sun, A. Nesen, B. Bhargava, and M. Stonebraker, “Applying machine learning and data fusion to the “missing person” problem,”IEEE Computer, vol. 55, no. 6, pp. 40–55, 2022

work page 2022
[4]

Where are they? a review of statistical techniques and data analysis to support the search for missing persons,

J. Ruiz Reyes, D. Congram, R. A. Sirbu, and L. Floridi, “Where are they? a review of statistical techniques and data analysis to support the search for missing persons,”Forensic Science International, vol. 376, p. 112582, 2025

work page 2025
[5]

Extracting meaningful entities from police narrative reports,

M. Chau, J. J. Xu, and H. Chen, “Extracting meaningful entities from police narrative reports,”Journal of the American Society for Information Science and Technology, vol. 53, no. 11, pp. 984–995, 2002

work page 2002
[6]

P. A. Longley, M. F. Goodchild, D. J. Maguire, and D. W. Rhind, Geographic Information Science and Systems, 4th ed. Wiley, 2015

work page 2015
[7]

Large-scale simulation of traffic flow using markov model,

R. Besenczi, N. B ´atfai, P. Jeszenszky, R. Major, F. Monori, and M. Isp´any, “Large-scale simulation of traffic flow using markov model,” PLOS ONE, vol. 16, no. 2, p. e0246062, 2021

work page 2021
[8]

S. Bird, E. Klein, and E. Loper,Natural Language Processing with Python. O’Reilly Media, 2009

work page 2009
[9]

Exploring ai-driven approaches for unstructured document analysis and future horizons,

S. V . Mahadevkar, S. Patil, K. Kotecha, L. W. Soong, and T. Choudhury, “Exploring ai-driven approaches for unstructured document analysis and future horizons,”Journal of Big Data, vol. 11, p. 92, 2024

work page 2024
[10]

Materials for the study of the locus operandi in the search for missing persons in italy,

P. M. Barone, R. M. Di Maggio, and S. Mesturini, “Materials for the study of the locus operandi in the search for missing persons in italy,” Forensic Sciences Research, vol. 7, no. 3, pp. 371–377, 2022

work page 2022
[11]

Grave mapping in support of the search for missing persons in conflict contexts,

D. Congram, M. W. Kenyhercz, and A. G. Green, “Grave mapping in support of the search for missing persons in conflict contexts,”Forensic Science International, vol. 278, pp. 260–268, 2017

work page 2017
[12]

An agent-based model reveals lost person behavior based on data from wilderness search and rescue,

A. Hashimoto, L. Heintzman, R. Koester, and N. Abaid, “An agent-based model reveals lost person behavior based on data from wilderness search and rescue,”Scientific Reports, vol. 12, p. 5873, 2022

work page 2022
[13]

Is a large language model a good annotator for event extraction?

R. Chen, C. Qin, W. Jiang, and D. Choi, “Is a large language model a good annotator for event extraction?” inProceedings of the Thirty- Eighth AAAI Conference on Artificial Intelligence (AAAI-24). AAAI, 2024, pp. 17 772–17 780

work page 2024
[14]

Snorkel: Rapid training data creation with weak supervision,

A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. R ´e, “Snorkel: Rapid training data creation with weak supervision,”Proceed- ings of the VLDB Endowment, vol. 11, no. 3, pp. 269–282, 2017

work page 2017
[15]

A multi-task evaluation of LLMs’ processing of academic text input,

T. Li, Y . Qin, and O. R. L. Sheng, “A multi-task evaluation of LLMs’ processing of academic text input,” 2025, arXiv:2508.11779

work page arXiv 2025
[16]

Weakly supervised text classification using supervision signals from a language model,

Z. Zeng, W. Ni, T. Fang, X. Li, X. Zhao, and Y . Song, “Weakly supervised text classification using supervision signals from a language model,” inFindings of the Association for Computational Linguistics: NAACL 2022, 2022, pp. 2295–2305

work page 2022
[17]

arXiv preprint arXiv:2205.14704 , year =

X. Chen, L. Li, N. Zhanget al., “Retrieval-augmented prompt learn- ing,” inAdvances in Neural Information Processing Systems, 2023, arXiv:2205.14704

work page arXiv 2023
[18]

A unified framework of five principles for AI in society,

L. Floridi and J. Cowls, “A unified framework of five principles for AI in society,”Harvard Data Science Review, vol. 1, no. 1, 2019

work page 2019
[19]

Balancing risks and oppor- tunities: New technologies and the search for missing people,

International Committee of the Red Cross, “Balancing risks and oppor- tunities: New technologies and the search for missing people,” ICRC, Tech. Rep., 2025

work page 2025

[1] [1]

Bielska, N

A. Bielska, N. R. Kurz, Y . Baumgartner, and V . Benetis,Open Source Intelligence Tools and Resources Handbook, 2020th ed. i-intelligence, 2020

work page 2020

[2] [2]

Open source intelligence on the internet – categorisation and evaluation of search tools,

D. Mider, “Open source intelligence on the internet – categorisation and evaluation of search tools,”Internal Security Review, vol. 31, pp. 383–412, 2024

work page 2024

[3] [3]

Applying machine learning and data fusion to the “missing person

K. M. A. Solaiman, T. Sun, A. Nesen, B. Bhargava, and M. Stonebraker, “Applying machine learning and data fusion to the “missing person” problem,”IEEE Computer, vol. 55, no. 6, pp. 40–55, 2022

work page 2022

[4] [4]

Where are they? a review of statistical techniques and data analysis to support the search for missing persons,

J. Ruiz Reyes, D. Congram, R. A. Sirbu, and L. Floridi, “Where are they? a review of statistical techniques and data analysis to support the search for missing persons,”Forensic Science International, vol. 376, p. 112582, 2025

work page 2025

[5] [5]

Extracting meaningful entities from police narrative reports,

M. Chau, J. J. Xu, and H. Chen, “Extracting meaningful entities from police narrative reports,”Journal of the American Society for Information Science and Technology, vol. 53, no. 11, pp. 984–995, 2002

work page 2002

[6] [6]

P. A. Longley, M. F. Goodchild, D. J. Maguire, and D. W. Rhind, Geographic Information Science and Systems, 4th ed. Wiley, 2015

work page 2015

[7] [7]

Large-scale simulation of traffic flow using markov model,

R. Besenczi, N. B ´atfai, P. Jeszenszky, R. Major, F. Monori, and M. Isp´any, “Large-scale simulation of traffic flow using markov model,” PLOS ONE, vol. 16, no. 2, p. e0246062, 2021

work page 2021

[8] [8]

S. Bird, E. Klein, and E. Loper,Natural Language Processing with Python. O’Reilly Media, 2009

work page 2009

[9] [9]

Exploring ai-driven approaches for unstructured document analysis and future horizons,

S. V . Mahadevkar, S. Patil, K. Kotecha, L. W. Soong, and T. Choudhury, “Exploring ai-driven approaches for unstructured document analysis and future horizons,”Journal of Big Data, vol. 11, p. 92, 2024

work page 2024

[10] [10]

Materials for the study of the locus operandi in the search for missing persons in italy,

P. M. Barone, R. M. Di Maggio, and S. Mesturini, “Materials for the study of the locus operandi in the search for missing persons in italy,” Forensic Sciences Research, vol. 7, no. 3, pp. 371–377, 2022

work page 2022

[11] [11]

Grave mapping in support of the search for missing persons in conflict contexts,

D. Congram, M. W. Kenyhercz, and A. G. Green, “Grave mapping in support of the search for missing persons in conflict contexts,”Forensic Science International, vol. 278, pp. 260–268, 2017

work page 2017

[12] [12]

An agent-based model reveals lost person behavior based on data from wilderness search and rescue,

A. Hashimoto, L. Heintzman, R. Koester, and N. Abaid, “An agent-based model reveals lost person behavior based on data from wilderness search and rescue,”Scientific Reports, vol. 12, p. 5873, 2022

work page 2022

[13] [13]

Is a large language model a good annotator for event extraction?

R. Chen, C. Qin, W. Jiang, and D. Choi, “Is a large language model a good annotator for event extraction?” inProceedings of the Thirty- Eighth AAAI Conference on Artificial Intelligence (AAAI-24). AAAI, 2024, pp. 17 772–17 780

work page 2024

[14] [14]

Snorkel: Rapid training data creation with weak supervision,

A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. R ´e, “Snorkel: Rapid training data creation with weak supervision,”Proceed- ings of the VLDB Endowment, vol. 11, no. 3, pp. 269–282, 2017

work page 2017

[15] [15]

A multi-task evaluation of LLMs’ processing of academic text input,

T. Li, Y . Qin, and O. R. L. Sheng, “A multi-task evaluation of LLMs’ processing of academic text input,” 2025, arXiv:2508.11779

work page arXiv 2025

[16] [16]

Weakly supervised text classification using supervision signals from a language model,

Z. Zeng, W. Ni, T. Fang, X. Li, X. Zhao, and Y . Song, “Weakly supervised text classification using supervision signals from a language model,” inFindings of the Association for Computational Linguistics: NAACL 2022, 2022, pp. 2295–2305

work page 2022

[17] [17]

arXiv preprint arXiv:2205.14704 , year =

X. Chen, L. Li, N. Zhanget al., “Retrieval-augmented prompt learn- ing,” inAdvances in Neural Information Processing Systems, 2023, arXiv:2205.14704

work page arXiv 2023

[18] [18]

A unified framework of five principles for AI in society,

L. Floridi and J. Cowls, “A unified framework of five principles for AI in society,”Harvard Data Science Review, vol. 1, no. 1, 2019

work page 2019

[19] [19]

Balancing risks and oppor- tunities: New technologies and the search for missing people,

International Committee of the Red Cross, “Balancing risks and oppor- tunities: New technologies and the search for missing people,” ICRC, Tech. Rep., 2025

work page 2025