Method for Aggregating Unstructured Data Using Large Language Models

Dmitriy Fedorov; Maria Shabarina; Natalia Tereshkina; Vsevolod Lazebnyi

arxiv: 2604.16425 · v1 · submitted 2026-04-04 · 💻 cs.DB · cs.LG

Method for Aggregating Unstructured Data Using Large Language Models

Vsevolod Lazebnyi , Natalia Tereshkina , Maria Shabarina , Dmitriy Fedorov This is my paper

Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3

classification 💻 cs.DB cs.LG

keywords unstructured datalarge language modelshallucination mitigationweb scrapingJSON schemadata aggregationembedding verification

0 comments

The pith

Large language models can turn unstructured web content into reliable JSON schemas by scraping pages and verifying outputs through embedding comparisons across temperature settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an automated pipeline that scrapes both static and dynamic web pages, stores raw content in MongoDB, and uses large language models to map the text into a fixed JSON schema. To counter hallucinations, the method runs the LLM multiple times at different temperature values, converts the outputs to embeddings, and selects only those results that show high similarity while also applying explicit consistency rules. Experiments indicate that key fields are populated accurately and that the system continues to work when page layouts change, making it practical for ongoing tasks such as news aggregation and monitoring. The approach removes the need for manual preprocessing code tailored to each source.

Core claim

The central claim is that a hybrid scraping layer combined with LLM-driven normalization into a predetermined JSON schema, followed by a two-stage verification that compares embeddings of outputs generated at different temperature parameters and applies formalized integrity rules, produces accurate structured data from unstructured web sources with robustness to page-structure changes.

What carries the argument

The two-stage verification process that generates multiple LLM outputs at different temperatures, computes their embeddings, and retains only consistent results while enforcing additional consistency rules.

If this is right

The pipeline can ingest data from many changing web sources without per-site preprocessing code.
Accuracy on key fields remains high enough for near real-time news and log aggregation.
Data can be stored directly in a non-relational database for immediate downstream use.
The same verification technique could be reused for other LLM-based extraction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding similarity across temperature settings might serve as a lightweight check for other LLM data-pipeline applications.
The method could be extended to combine outputs from multiple different LLM models rather than temperature variations of one model.
Storing both raw pages and verified JSON in the same database opens the possibility of automated re-verification when new rules are added.

Load-bearing premise

Comparing embeddings of LLM outputs produced at different temperatures is sufficient to detect and remove hallucinations when mapping arbitrary web text to a target JSON schema.

What would settle it

A collection of web pages where an LLM produces incorrect JSON values that nevertheless yield nearly identical embeddings across several temperature settings, causing the verification step to accept the errors.

Figures

Figures reproduced from arXiv: 2604.16425 by Dmitriy Fedorov, Maria Shabarina, Natalia Tereshkina, Vsevolod Lazebnyi.

**Figure 1.** Figure 1: DFD schema of the method. Step 4. Multi-level validation and quality assurance. To enhance reliability, a dual-stage hallucination check is used: at the first stage, a triple query to LLM with various temperatures to generate three similar texts, followed by a semantic contrast of embeddings to detect hallucinations of essential fields, as well as a formal rule check (precision of formats, mandatory fields… view at source ↗

**Figure 2.** Figure 2: IDEF0 of method. among the system’s core modules, capturing the end-to-end progression from raw data acquisition to the generation of the final structured response exposed via the API. The process begins with the “Collect Data” stage, where the system acquires news content from predefined or dynamically added sources (Sources Addition). The retrieved unstructured data are then persisted in the database (Da… view at source ↗

**Figure 3.** Figure 3: IDEF1 of method. verify this, we used the BERT model (bert-base-multilingual-uncased) to compare the collected news articles with the summaries generated from them using key phrases. The average similarity between the two was 87%, based on a sample of 1600 news articles and generated digests. This indicates that the semantic content was not significantly lost during the process [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 4.** Figure 4: BERT Similarity between raw text and aggregated text. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

This paper presents a method for the automated collection and aggregation of unstructured data from diverse web sources, utilizing Large Language Models (LLMs). The primary challenge with existing techniques is their instability when the structure of webpages changes, their limited support for dynamically loaded content during information collection, and the requirement for labor-intensive manual design of data pre-processing processes. The proposed algorithm integrates hybrid web scraping (Goose3 for static pages and Selenium+WebDriver for dynamic ones), data storage in a non-relational MongoDB database management system (DBMS), and intelligent extraction and normalization of information using LLMs into a predetermined JSON schema. A key scientific contribution of this study is a two-stage verification process for the generated data, designed to eliminate potential hallucinations byy comparing the embeddings of multiple LLM outputs obtained with different temperature parameter values, combined with formalized rules for monitoring data consistency and integrity. The experimental findings indicate a high level of accuracy in the completion of key fields, as well as the robustness of the proposed methodology to changes in web page structures. This makes it suitable for use in tasks such as news content aggregation, monitoring, and log analysis in near real-time mode, with the capacity to scale rapidly in terms of the number of sources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical scraping-plus-LLM pipeline whose embedding consistency check only measures agreement, not correctness against the source.

read the letter

The paper's main contribution is a complete workflow that scrapes web pages with Goose3 for static content and Selenium for dynamic pages, stores the raw results in MongoDB, and uses LLMs to extract and normalize the data into a fixed JSON schema. The distinctive piece is the two-stage verification: generate several outputs at different temperatures, compare their embeddings, and apply rules to flag inconsistencies. This is presented as a way to reduce hallucinations without extra manual work. The approach is straightforward and directly targets real engineering headaches like page structure changes and the cost of custom preprocessing for each source. For tasks such as news aggregation or near-real-time monitoring, the end-to-end design could be useful in practice because it avoids brittle selectors and scales by adding more sources. The description of the pipeline is clear and shows honest attention to the operational constraints of web data collection. The main weakness is the evaluation. The abstract claims high accuracy and robustness, yet supplies no numbers, baselines, datasets, or methodology details, so the results cannot be checked. More critically, the verification step only tests whether outputs agree with each other across temperatures. If the LLM produces the same incorrect extraction consistently, the embeddings will match and the output will pass. No step is described that compares the extracted fields back to the original page text or any ground truth. This leaves the central claim about eliminating hallucinations untested. The work is aimed at practitioners building data pipelines who need something that mostly works on unstable sources. Readers looking for new algorithms, formal proofs, or rigorous LLM benchmarks will not find them here. It deserves peer review once the authors add concrete metrics and test whether the consistency check actually improves accuracy; without that evidence the paper is too light to assess properly.

Referee Report

2 major / 1 minor

Summary. The paper presents a method for automated aggregation of unstructured web data using LLMs. It integrates hybrid scraping (Goose3 for static content and Selenium for dynamic), MongoDB storage, and LLM-based extraction/normalization into a target JSON schema. The central contribution is a two-stage verification process that generates multiple LLM outputs at different temperature values, compares their embeddings to detect hallucinations, and applies formalized consistency/integrity rules. The abstract reports that experiments demonstrate high accuracy in key fields and robustness to webpage structure changes, making the approach suitable for near-real-time tasks such as news aggregation and monitoring.

Significance. If the verification procedure and extraction accuracy hold, the work could offer a practical, scalable alternative to brittle rule-based scraping pipelines for dynamic web sources. The hybrid scraping plus LLM normalization addresses a real engineering pain point in data collection. However, the absence of any quantitative metrics, baselines, dataset descriptions, or ground-truth comparisons in the provided text substantially limits the ability to assess whether the claimed accuracy and hallucination elimination are achieved.

major comments (2)

[Abstract] Abstract: the claim that experiments show 'a high level of accuracy in the completion of key fields' is unsupported because no quantitative metrics, baselines, error bars, dataset sizes, or exclusion criteria are supplied, preventing verification of the data-to-claim link.
[Verification Process] Description of the two-stage verification process: comparing embeddings of outputs generated at different temperatures measures inter-sample consistency but does not cross-check extracted fields against the original webpage text or any external ground truth; therefore consistent but factually incorrect extractions (e.g., fabricated values that fit the JSON schema) can pass verification.

minor comments (1)

[Abstract] Abstract contains the typo 'byy comparing' (should be 'by comparing').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and noting revisions where the manuscript will be updated to strengthen the presentation of results and limitations.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that experiments show 'a high level of accuracy in the completion of key fields' is unsupported because no quantitative metrics, baselines, error bars, dataset sizes, or exclusion criteria are supplied, preventing verification of the data-to-claim link.

Authors: We agree that the abstract claim requires explicit supporting evidence to be verifiable. The full manuscript contains an experiments section with evaluation details, but to directly address the concern we have revised the abstract to incorporate specific quantitative metrics (e.g., precision/recall on key fields), the number of test webpages, and a brief description of the evaluation protocol and exclusion criteria. These additions make the data-to-claim linkage explicit without altering the original findings. revision: yes
Referee: [Verification Process] Description of the two-stage verification process: comparing embeddings of outputs generated at different temperatures measures inter-sample consistency but does not cross-check extracted fields against the original webpage text or any external ground truth; therefore consistent but factually incorrect extractions (e.g., fabricated values that fit the JSON schema) can pass verification.

Authors: The referee correctly notes that our verification relies on inter-sample consistency (via embedding comparison across temperature settings) plus formalized integrity rules, rather than direct grounding against the source webpage text or external ground truth. This design detects many hallucinations through inconsistency but cannot rule out coherent yet fabricated values. We have revised the manuscript to explicitly acknowledge this limitation in the verification-process section, to clarify the method's scope as a practical heuristic rather than a complete factual verifier, and to outline planned extensions such as source-text alignment checks. revision: partial

Circularity Check

0 steps flagged

No circularity; purely procedural description without derivations or self-referential reductions

full rationale

The paper presents a procedural algorithm for web scraping, MongoDB storage, LLM-based JSON extraction, and a two-stage verification step that compares embeddings of outputs generated at different temperatures plus consistency rules. No equations, fitted parameters, derivations, or mathematical claims appear anywhere in the provided text. The verification process is described directly as a design choice rather than derived from prior results or reduced to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing manner. The method is therefore self-contained as an engineering description and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on domain assumptions about LLM extraction reliability and the effectiveness of embedding comparison for hallucination detection, with no free parameters or new entities explicitly fitted or postulated in the abstract.

axioms (2)

domain assumption LLMs can extract and normalize information from unstructured web text into a predetermined JSON schema when given appropriate prompts
Invoked in the intelligent extraction step of the algorithm.
ad hoc to paper Semantic embeddings of multiple LLM outputs generated at different temperatures can be compared to detect and eliminate hallucinations
Core mechanism of the two-stage verification process.

pith-pipeline@v0.9.0 · 5526 in / 1373 out tokens · 46696 ms · 2026-05-13T17:08:56.321654+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Ltd, Pune, Maharashtra, Suhrud Wani, Innoplexus Consulting Services Pvt

Aman Ahluwalia, PerpetualBlock Technologies Pvt. Ltd, Pune, Maharashtra, Suhrud Wani, Innoplexus Consulting Services Pvt. Ltd., Pune, Maharashtr: Leveraging Large Language Models for Web Scraping (https://arxiv.org/pdf/2406.08246), last accessed 2025/12/3

work page arXiv 2025
[2]

Using large language models for preprocessing and information extraction from unstructured text: A proof-of- concept application in the social sciences

Schwitter, N. Using large language models for preprocessing and information extraction from unstructured text: A proof-of- concept application in the social sciences. Methodological Innovations. 2025; 18

work page 2025
[3]

William Brach, Kristián Košt’ál, Michal Ries: The Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats (https://arxiv.org/html/2503.02650v1), last accessed 2025/11/25

work page arXiv 2025
[4]

BMJ Health & Care Informatics

Ntinopoulos, V ., Biefer, H., Tudorache, I., Papadopoulos, N., Odavic, D., Risteski, P., Haeussler, A., & Dzemali, O.: Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. BMJ Health & Care Informatics. 2025; 32

work page 2025
[5]

To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data

Remadi, A., Hage, K., Hobeika, Y ., & Bugiotti, F. To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data. Data Knowl. Eng.. 2024; 152. 9 Method for aggregating unstructured data using LLMA PREPRINT

work page 2024
[6]

Juraj Vladika, Ihsan Soydemir, Florian Matthes: Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge (https://arxiv.org/pdf/2506.19607), last accessed 2025/12/3

work page arXiv 2025
[7]

Web Data Extraction, Applications and Techniques: A Survey

Ferrara, E., Meo, P., Fiumara, G., & Baumgartner, R. Web Data Extraction, Applications and Techniques: A Survey. Knowl. Based Syst.. 2012; 70

work page 2012
[8]

Fields of Gold: Scraping Web Data for Marketing Insights

Boegershausen, J., Datta, H., Borah, A., & Stephen, A. Fields of Gold: Scraping Web Data for Marketing Insights. Journal of Marketing. 2022; 86

work page 2022
[9]

An analytical study of information extraction from unstructured and multidimensional big data

Adnan, K., & Akbar, R. An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data. 2019; 6. (https://doi.org/10.1186/s40537-019-0254-8.)

work page doi:10.1186/s40537-019-0254-8 2019
[10]

LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models

Wiest, I., Wolf, F., Leßmann, M., Van Treeck, M., Ferber, D., Zhu, J., Boehme, H., Bressem, K., Ulrich, H., Ebert, M., & Kather, J. LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models. medRxiv. 2024

work page 2024
[11]

Using LLMs for the Extraction and Normalization of Product Attribute Values

Baumann, N., Brinkmann, A., & Bizer, C. Using LLMs for the Extraction and Normalization of Product Attribute Values. 2024 (https://doi.org/10.1007/978-3-031-70626-4_15), last accessed 2025/12/4

work page doi:10.1007/978-3-031-70626-4_15 2024
[12]

Web Scraping using Natural Language Processing: Exploiting Unstructured Text for Data Extraction and Analysis

Pichiyan, V ., Muthulingam, S., G, S., Nalajala, S., Ch, A., & Das, M. Web Scraping using Natural Language Processing: Exploiting Unstructured Text for Data Extraction and Analysis. Procedia Computer Science. 2023

work page 2023
[13]

In CTU Journal of Innovation and Sustainable Development 16(Special issue: ISDS):58-68: An automated data collection process for constructing graph data relying on LLMs (2024)

Ngoc Ton Ho, Hoang Son Nguyen, Ngoc Minh Chau Ngueyen. In CTU Journal of Innovation and Sustainable Development 16(Special issue: ISDS):58-68: An automated data collection process for constructing graph data relying on LLMs (2024)

work page 2024
[14]

Le Xiao, Xiaolin Chen: Enhancing LLM with Evolutionary Fine-Tuning for News Summary Generation (https://arxiv.org/pdf/2307.02839), last accessed 2025/12/1

work page arXiv 2025
[15]

Scientific Reports

Rong Hu, Ye Yang, Sen Liu, Zuchen Li, Jingyi Liu, Xingchen Ding, Hanchi Sun, Lingli Ren: Large language model driven transferable key information extraction mechanism for nonstandardized tables. Scientific Reports. 2025

work page 2025
[16]

Aatif Nisar Dar, Aditya Raj Singh, Anirban Sen: PolicyStory: Leveraging Large Language Models to Generate Comprehensible Summaries of Policy-News in India, (https://arxiv.org/abs/2509.08218), last accessed 2025/12/4

work page arXiv 2025
[17]

Holli Sargeant, Ahmed Izzidien, Felix Steffek: Topic Classification of Case Law Using a Large Language Model and a New Taxonomy for UK Law: AI Insights into Summary Judgment (https://arxiv.org/abs/2405.12910), last accessed 2025/12/4

work page arXiv 2025
[18]

Botond Barta, Dorina Lakatos, Attila Nagy, Milán Konor Nyist, Judit Ács: From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization (https://arxiv.org/abs/2404.03555), last accessed 2025/12/4

work page arXiv 2025
[19]

Laura Mascarell, Ribin Chalumattu, Annette Rios: German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset (https://arxiv.org/abs/2403.03750), last accessed 2025/12/4

work page arXiv 2025
[20]

Lukas Stankeviˇcius, Mantas Lukoševiˇcius: Generating abstractive summaries of Lithuanian news articles using a transformer model (https://arxiv.org/abs/2105.03279), last accessed 2025/12/4. 10

work page arXiv 2025

[1] [1]

Ltd, Pune, Maharashtra, Suhrud Wani, Innoplexus Consulting Services Pvt

Aman Ahluwalia, PerpetualBlock Technologies Pvt. Ltd, Pune, Maharashtra, Suhrud Wani, Innoplexus Consulting Services Pvt. Ltd., Pune, Maharashtr: Leveraging Large Language Models for Web Scraping (https://arxiv.org/pdf/2406.08246), last accessed 2025/12/3

work page arXiv 2025

[2] [2]

Using large language models for preprocessing and information extraction from unstructured text: A proof-of- concept application in the social sciences

Schwitter, N. Using large language models for preprocessing and information extraction from unstructured text: A proof-of- concept application in the social sciences. Methodological Innovations. 2025; 18

work page 2025

[3] [3]

William Brach, Kristián Košt’ál, Michal Ries: The Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats (https://arxiv.org/html/2503.02650v1), last accessed 2025/11/25

work page arXiv 2025

[4] [4]

BMJ Health & Care Informatics

Ntinopoulos, V ., Biefer, H., Tudorache, I., Papadopoulos, N., Odavic, D., Risteski, P., Haeussler, A., & Dzemali, O.: Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. BMJ Health & Care Informatics. 2025; 32

work page 2025

[5] [5]

To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data

Remadi, A., Hage, K., Hobeika, Y ., & Bugiotti, F. To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data. Data Knowl. Eng.. 2024; 152. 9 Method for aggregating unstructured data using LLMA PREPRINT

work page 2024

[6] [6]

Juraj Vladika, Ihsan Soydemir, Florian Matthes: Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge (https://arxiv.org/pdf/2506.19607), last accessed 2025/12/3

work page arXiv 2025

[7] [7]

Web Data Extraction, Applications and Techniques: A Survey

Ferrara, E., Meo, P., Fiumara, G., & Baumgartner, R. Web Data Extraction, Applications and Techniques: A Survey. Knowl. Based Syst.. 2012; 70

work page 2012

[8] [8]

Fields of Gold: Scraping Web Data for Marketing Insights

Boegershausen, J., Datta, H., Borah, A., & Stephen, A. Fields of Gold: Scraping Web Data for Marketing Insights. Journal of Marketing. 2022; 86

work page 2022

[9] [9]

An analytical study of information extraction from unstructured and multidimensional big data

Adnan, K., & Akbar, R. An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data. 2019; 6. (https://doi.org/10.1186/s40537-019-0254-8.)

work page doi:10.1186/s40537-019-0254-8 2019

[10] [10]

LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models

Wiest, I., Wolf, F., Leßmann, M., Van Treeck, M., Ferber, D., Zhu, J., Boehme, H., Bressem, K., Ulrich, H., Ebert, M., & Kather, J. LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models. medRxiv. 2024

work page 2024

[11] [11]

Using LLMs for the Extraction and Normalization of Product Attribute Values

Baumann, N., Brinkmann, A., & Bizer, C. Using LLMs for the Extraction and Normalization of Product Attribute Values. 2024 (https://doi.org/10.1007/978-3-031-70626-4_15), last accessed 2025/12/4

work page doi:10.1007/978-3-031-70626-4_15 2024

[12] [12]

Web Scraping using Natural Language Processing: Exploiting Unstructured Text for Data Extraction and Analysis

Pichiyan, V ., Muthulingam, S., G, S., Nalajala, S., Ch, A., & Das, M. Web Scraping using Natural Language Processing: Exploiting Unstructured Text for Data Extraction and Analysis. Procedia Computer Science. 2023

work page 2023

[13] [13]

In CTU Journal of Innovation and Sustainable Development 16(Special issue: ISDS):58-68: An automated data collection process for constructing graph data relying on LLMs (2024)

Ngoc Ton Ho, Hoang Son Nguyen, Ngoc Minh Chau Ngueyen. In CTU Journal of Innovation and Sustainable Development 16(Special issue: ISDS):58-68: An automated data collection process for constructing graph data relying on LLMs (2024)

work page 2024

[14] [14]

Le Xiao, Xiaolin Chen: Enhancing LLM with Evolutionary Fine-Tuning for News Summary Generation (https://arxiv.org/pdf/2307.02839), last accessed 2025/12/1

work page arXiv 2025

[15] [15]

Scientific Reports

Rong Hu, Ye Yang, Sen Liu, Zuchen Li, Jingyi Liu, Xingchen Ding, Hanchi Sun, Lingli Ren: Large language model driven transferable key information extraction mechanism for nonstandardized tables. Scientific Reports. 2025

work page 2025

[16] [16]

Aatif Nisar Dar, Aditya Raj Singh, Anirban Sen: PolicyStory: Leveraging Large Language Models to Generate Comprehensible Summaries of Policy-News in India, (https://arxiv.org/abs/2509.08218), last accessed 2025/12/4

work page arXiv 2025

[17] [17]

Holli Sargeant, Ahmed Izzidien, Felix Steffek: Topic Classification of Case Law Using a Large Language Model and a New Taxonomy for UK Law: AI Insights into Summary Judgment (https://arxiv.org/abs/2405.12910), last accessed 2025/12/4

work page arXiv 2025

[18] [18]

Botond Barta, Dorina Lakatos, Attila Nagy, Milán Konor Nyist, Judit Ács: From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization (https://arxiv.org/abs/2404.03555), last accessed 2025/12/4

work page arXiv 2025

[19] [19]

Laura Mascarell, Ribin Chalumattu, Annette Rios: German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset (https://arxiv.org/abs/2403.03750), last accessed 2025/12/4

work page arXiv 2025

[20] [20]

Lukas Stankeviˇcius, Mantas Lukoševiˇcius: Generating abstractive summaries of Lithuanian news articles using a transformer model (https://arxiv.org/abs/2105.03279), last accessed 2025/12/4. 10

work page arXiv 2025