pith. sign in

arxiv: 2604.16425 · v1 · submitted 2026-04-04 · 💻 cs.DB · cs.LG

Method for Aggregating Unstructured Data Using Large Language Models

Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3

classification 💻 cs.DB cs.LG
keywords unstructured datalarge language modelshallucination mitigationweb scrapingJSON schemadata aggregationembedding verification
0
0 comments X

The pith

Large language models can turn unstructured web content into reliable JSON schemas by scraping pages and verifying outputs through embedding comparisons across temperature settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an automated pipeline that scrapes both static and dynamic web pages, stores raw content in MongoDB, and uses large language models to map the text into a fixed JSON schema. To counter hallucinations, the method runs the LLM multiple times at different temperature values, converts the outputs to embeddings, and selects only those results that show high similarity while also applying explicit consistency rules. Experiments indicate that key fields are populated accurately and that the system continues to work when page layouts change, making it practical for ongoing tasks such as news aggregation and monitoring. The approach removes the need for manual preprocessing code tailored to each source.

Core claim

The central claim is that a hybrid scraping layer combined with LLM-driven normalization into a predetermined JSON schema, followed by a two-stage verification that compares embeddings of outputs generated at different temperature parameters and applies formalized integrity rules, produces accurate structured data from unstructured web sources with robustness to page-structure changes.

What carries the argument

The two-stage verification process that generates multiple LLM outputs at different temperatures, computes their embeddings, and retains only consistent results while enforcing additional consistency rules.

If this is right

  • The pipeline can ingest data from many changing web sources without per-site preprocessing code.
  • Accuracy on key fields remains high enough for near real-time news and log aggregation.
  • Data can be stored directly in a non-relational database for immediate downstream use.
  • The same verification technique could be reused for other LLM-based extraction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding similarity across temperature settings might serve as a lightweight check for other LLM data-pipeline applications.
  • The method could be extended to combine outputs from multiple different LLM models rather than temperature variations of one model.
  • Storing both raw pages and verified JSON in the same database opens the possibility of automated re-verification when new rules are added.

Load-bearing premise

Comparing embeddings of LLM outputs produced at different temperatures is sufficient to detect and remove hallucinations when mapping arbitrary web text to a target JSON schema.

What would settle it

A collection of web pages where an LLM produces incorrect JSON values that nevertheless yield nearly identical embeddings across several temperature settings, causing the verification step to accept the errors.

Figures

Figures reproduced from arXiv: 2604.16425 by Dmitriy Fedorov, Maria Shabarina, Natalia Tereshkina, Vsevolod Lazebnyi.

Figure 1
Figure 1. Figure 1: DFD schema of the method. Step 4. Multi-level validation and quality assurance. To enhance reliability, a dual-stage hallucination check is used: at the first stage, a triple query to LLM with various temperatures to generate three similar texts, followed by a semantic contrast of embeddings to detect hallucinations of essential fields, as well as a formal rule check (precision of formats, mandatory fields… view at source ↗
Figure 2
Figure 2. Figure 2: IDEF0 of method. among the system’s core modules, capturing the end-to-end progression from raw data acquisition to the generation of the final structured response exposed via the API. The process begins with the “Collect Data” stage, where the system acquires news content from predefined or dynamically added sources (Sources Addition). The retrieved unstructured data are then persisted in the database (Da… view at source ↗
Figure 3
Figure 3. Figure 3: IDEF1 of method. verify this, we used the BERT model (bert-base-multilingual-uncased) to compare the collected news articles with the summaries generated from them using key phrases. The average similarity between the two was 87%, based on a sample of 1600 news articles and generated digests. This indicates that the semantic content was not significantly lost during the process [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 4
Figure 4. Figure 4: BERT Similarity between raw text and aggregated text. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

This paper presents a method for the automated collection and aggregation of unstructured data from diverse web sources, utilizing Large Language Models (LLMs). The primary challenge with existing techniques is their instability when the structure of webpages changes, their limited support for dynamically loaded content during information collection, and the requirement for labor-intensive manual design of data pre-processing processes. The proposed algorithm integrates hybrid web scraping (Goose3 for static pages and Selenium+WebDriver for dynamic ones), data storage in a non-relational MongoDB database management system (DBMS), and intelligent extraction and normalization of information using LLMs into a predetermined JSON schema. A key scientific contribution of this study is a two-stage verification process for the generated data, designed to eliminate potential hallucinations byy comparing the embeddings of multiple LLM outputs obtained with different temperature parameter values, combined with formalized rules for monitoring data consistency and integrity. The experimental findings indicate a high level of accuracy in the completion of key fields, as well as the robustness of the proposed methodology to changes in web page structures. This makes it suitable for use in tasks such as news content aggregation, monitoring, and log analysis in near real-time mode, with the capacity to scale rapidly in terms of the number of sources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a method for automated aggregation of unstructured web data using LLMs. It integrates hybrid scraping (Goose3 for static content and Selenium for dynamic), MongoDB storage, and LLM-based extraction/normalization into a target JSON schema. The central contribution is a two-stage verification process that generates multiple LLM outputs at different temperature values, compares their embeddings to detect hallucinations, and applies formalized consistency/integrity rules. The abstract reports that experiments demonstrate high accuracy in key fields and robustness to webpage structure changes, making the approach suitable for near-real-time tasks such as news aggregation and monitoring.

Significance. If the verification procedure and extraction accuracy hold, the work could offer a practical, scalable alternative to brittle rule-based scraping pipelines for dynamic web sources. The hybrid scraping plus LLM normalization addresses a real engineering pain point in data collection. However, the absence of any quantitative metrics, baselines, dataset descriptions, or ground-truth comparisons in the provided text substantially limits the ability to assess whether the claimed accuracy and hallucination elimination are achieved.

major comments (2)
  1. [Abstract] Abstract: the claim that experiments show 'a high level of accuracy in the completion of key fields' is unsupported because no quantitative metrics, baselines, error bars, dataset sizes, or exclusion criteria are supplied, preventing verification of the data-to-claim link.
  2. [Verification Process] Description of the two-stage verification process: comparing embeddings of outputs generated at different temperatures measures inter-sample consistency but does not cross-check extracted fields against the original webpage text or any external ground truth; therefore consistent but factually incorrect extractions (e.g., fabricated values that fit the JSON schema) can pass verification.
minor comments (1)
  1. [Abstract] Abstract contains the typo 'byy comparing' (should be 'by comparing').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and noting revisions where the manuscript will be updated to strengthen the presentation of results and limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that experiments show 'a high level of accuracy in the completion of key fields' is unsupported because no quantitative metrics, baselines, error bars, dataset sizes, or exclusion criteria are supplied, preventing verification of the data-to-claim link.

    Authors: We agree that the abstract claim requires explicit supporting evidence to be verifiable. The full manuscript contains an experiments section with evaluation details, but to directly address the concern we have revised the abstract to incorporate specific quantitative metrics (e.g., precision/recall on key fields), the number of test webpages, and a brief description of the evaluation protocol and exclusion criteria. These additions make the data-to-claim linkage explicit without altering the original findings. revision: yes

  2. Referee: [Verification Process] Description of the two-stage verification process: comparing embeddings of outputs generated at different temperatures measures inter-sample consistency but does not cross-check extracted fields against the original webpage text or any external ground truth; therefore consistent but factually incorrect extractions (e.g., fabricated values that fit the JSON schema) can pass verification.

    Authors: The referee correctly notes that our verification relies on inter-sample consistency (via embedding comparison across temperature settings) plus formalized integrity rules, rather than direct grounding against the source webpage text or external ground truth. This design detects many hallucinations through inconsistency but cannot rule out coherent yet fabricated values. We have revised the manuscript to explicitly acknowledge this limitation in the verification-process section, to clarify the method's scope as a practical heuristic rather than a complete factual verifier, and to outline planned extensions such as source-text alignment checks. revision: partial

Circularity Check

0 steps flagged

No circularity; purely procedural description without derivations or self-referential reductions

full rationale

The paper presents a procedural algorithm for web scraping, MongoDB storage, LLM-based JSON extraction, and a two-stage verification step that compares embeddings of outputs generated at different temperatures plus consistency rules. No equations, fitted parameters, derivations, or mathematical claims appear anywhere in the provided text. The verification process is described directly as a design choice rather than derived from prior results or reduced to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing manner. The method is therefore self-contained as an engineering description and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on domain assumptions about LLM extraction reliability and the effectiveness of embedding comparison for hallucination detection, with no free parameters or new entities explicitly fitted or postulated in the abstract.

axioms (2)
  • domain assumption LLMs can extract and normalize information from unstructured web text into a predetermined JSON schema when given appropriate prompts
    Invoked in the intelligent extraction step of the algorithm.
  • ad hoc to paper Semantic embeddings of multiple LLM outputs generated at different temperatures can be compared to detect and eliminate hallucinations
    Core mechanism of the two-stage verification process.

pith-pipeline@v0.9.0 · 5526 in / 1373 out tokens · 46696 ms · 2026-05-13T17:08:56.321654+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Ltd, Pune, Maharashtra, Suhrud Wani, Innoplexus Consulting Services Pvt

    Aman Ahluwalia, PerpetualBlock Technologies Pvt. Ltd, Pune, Maharashtra, Suhrud Wani, Innoplexus Consulting Services Pvt. Ltd., Pune, Maharashtr: Leveraging Large Language Models for Web Scraping (https://arxiv.org/pdf/2406.08246), last accessed 2025/12/3

  2. [2]

    Using large language models for preprocessing and information extraction from unstructured text: A proof-of- concept application in the social sciences

    Schwitter, N. Using large language models for preprocessing and information extraction from unstructured text: A proof-of- concept application in the social sciences. Methodological Innovations. 2025; 18

  3. [3]

    William Brach, Kristián Košt’ál, Michal Ries: The Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats (https://arxiv.org/html/2503.02650v1), last accessed 2025/11/25

  4. [4]

    BMJ Health & Care Informatics

    Ntinopoulos, V ., Biefer, H., Tudorache, I., Papadopoulos, N., Odavic, D., Risteski, P., Haeussler, A., & Dzemali, O.: Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. BMJ Health & Care Informatics. 2025; 32

  5. [5]

    To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data

    Remadi, A., Hage, K., Hobeika, Y ., & Bugiotti, F. To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data. Data Knowl. Eng.. 2024; 152. 9 Method for aggregating unstructured data using LLMA PREPRINT

  6. [6]

    Juraj Vladika, Ihsan Soydemir, Florian Matthes: Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge (https://arxiv.org/pdf/2506.19607), last accessed 2025/12/3

  7. [7]

    Web Data Extraction, Applications and Techniques: A Survey

    Ferrara, E., Meo, P., Fiumara, G., & Baumgartner, R. Web Data Extraction, Applications and Techniques: A Survey. Knowl. Based Syst.. 2012; 70

  8. [8]

    Fields of Gold: Scraping Web Data for Marketing Insights

    Boegershausen, J., Datta, H., Borah, A., & Stephen, A. Fields of Gold: Scraping Web Data for Marketing Insights. Journal of Marketing. 2022; 86

  9. [9]

    An analytical study of information extraction from unstructured and multidimensional big data

    Adnan, K., & Akbar, R. An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data. 2019; 6. (https://doi.org/10.1186/s40537-019-0254-8.)

  10. [10]

    LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models

    Wiest, I., Wolf, F., Leßmann, M., Van Treeck, M., Ferber, D., Zhu, J., Boehme, H., Bressem, K., Ulrich, H., Ebert, M., & Kather, J. LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models. medRxiv. 2024

  11. [11]

    Using LLMs for the Extraction and Normalization of Product Attribute Values

    Baumann, N., Brinkmann, A., & Bizer, C. Using LLMs for the Extraction and Normalization of Product Attribute Values. 2024 (https://doi.org/10.1007/978-3-031-70626-4_15), last accessed 2025/12/4

  12. [12]

    Web Scraping using Natural Language Processing: Exploiting Unstructured Text for Data Extraction and Analysis

    Pichiyan, V ., Muthulingam, S., G, S., Nalajala, S., Ch, A., & Das, M. Web Scraping using Natural Language Processing: Exploiting Unstructured Text for Data Extraction and Analysis. Procedia Computer Science. 2023

  13. [13]

    In CTU Journal of Innovation and Sustainable Development 16(Special issue: ISDS):58-68: An automated data collection process for constructing graph data relying on LLMs (2024)

    Ngoc Ton Ho, Hoang Son Nguyen, Ngoc Minh Chau Ngueyen. In CTU Journal of Innovation and Sustainable Development 16(Special issue: ISDS):58-68: An automated data collection process for constructing graph data relying on LLMs (2024)

  14. [14]

    Le Xiao, Xiaolin Chen: Enhancing LLM with Evolutionary Fine-Tuning for News Summary Generation (https://arxiv.org/pdf/2307.02839), last accessed 2025/12/1

  15. [15]

    Scientific Reports

    Rong Hu, Ye Yang, Sen Liu, Zuchen Li, Jingyi Liu, Xingchen Ding, Hanchi Sun, Lingli Ren: Large language model driven transferable key information extraction mechanism for nonstandardized tables. Scientific Reports. 2025

  16. [16]

    Aatif Nisar Dar, Aditya Raj Singh, Anirban Sen: PolicyStory: Leveraging Large Language Models to Generate Comprehensible Summaries of Policy-News in India, (https://arxiv.org/abs/2509.08218), last accessed 2025/12/4

  17. [17]

    Holli Sargeant, Ahmed Izzidien, Felix Steffek: Topic Classification of Case Law Using a Large Language Model and a New Taxonomy for UK Law: AI Insights into Summary Judgment (https://arxiv.org/abs/2405.12910), last accessed 2025/12/4

  18. [18]

    Botond Barta, Dorina Lakatos, Attila Nagy, Milán Konor Nyist, Judit Ács: From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization (https://arxiv.org/abs/2404.03555), last accessed 2025/12/4

  19. [19]

    Laura Mascarell, Ribin Chalumattu, Annette Rios: German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset (https://arxiv.org/abs/2403.03750), last accessed 2025/12/4

  20. [20]

    Lukas Stankeviˇcius, Mantas Lukoševiˇcius: Generating abstractive summaries of Lithuanian news articles using a transformer model (https://arxiv.org/abs/2105.03279), last accessed 2025/12/4. 10