Method for Aggregating Unstructured Data Using Large Language Models
Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3
The pith
Large language models can turn unstructured web content into reliable JSON schemas by scraping pages and verifying outputs through embedding comparisons across temperature settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a hybrid scraping layer combined with LLM-driven normalization into a predetermined JSON schema, followed by a two-stage verification that compares embeddings of outputs generated at different temperature parameters and applies formalized integrity rules, produces accurate structured data from unstructured web sources with robustness to page-structure changes.
What carries the argument
The two-stage verification process that generates multiple LLM outputs at different temperatures, computes their embeddings, and retains only consistent results while enforcing additional consistency rules.
If this is right
- The pipeline can ingest data from many changing web sources without per-site preprocessing code.
- Accuracy on key fields remains high enough for near real-time news and log aggregation.
- Data can be stored directly in a non-relational database for immediate downstream use.
- The same verification technique could be reused for other LLM-based extraction tasks.
Where Pith is reading between the lines
- Embedding similarity across temperature settings might serve as a lightweight check for other LLM data-pipeline applications.
- The method could be extended to combine outputs from multiple different LLM models rather than temperature variations of one model.
- Storing both raw pages and verified JSON in the same database opens the possibility of automated re-verification when new rules are added.
Load-bearing premise
Comparing embeddings of LLM outputs produced at different temperatures is sufficient to detect and remove hallucinations when mapping arbitrary web text to a target JSON schema.
What would settle it
A collection of web pages where an LLM produces incorrect JSON values that nevertheless yield nearly identical embeddings across several temperature settings, causing the verification step to accept the errors.
Figures
read the original abstract
This paper presents a method for the automated collection and aggregation of unstructured data from diverse web sources, utilizing Large Language Models (LLMs). The primary challenge with existing techniques is their instability when the structure of webpages changes, their limited support for dynamically loaded content during information collection, and the requirement for labor-intensive manual design of data pre-processing processes. The proposed algorithm integrates hybrid web scraping (Goose3 for static pages and Selenium+WebDriver for dynamic ones), data storage in a non-relational MongoDB database management system (DBMS), and intelligent extraction and normalization of information using LLMs into a predetermined JSON schema. A key scientific contribution of this study is a two-stage verification process for the generated data, designed to eliminate potential hallucinations byy comparing the embeddings of multiple LLM outputs obtained with different temperature parameter values, combined with formalized rules for monitoring data consistency and integrity. The experimental findings indicate a high level of accuracy in the completion of key fields, as well as the robustness of the proposed methodology to changes in web page structures. This makes it suitable for use in tasks such as news content aggregation, monitoring, and log analysis in near real-time mode, with the capacity to scale rapidly in terms of the number of sources.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a method for automated aggregation of unstructured web data using LLMs. It integrates hybrid scraping (Goose3 for static content and Selenium for dynamic), MongoDB storage, and LLM-based extraction/normalization into a target JSON schema. The central contribution is a two-stage verification process that generates multiple LLM outputs at different temperature values, compares their embeddings to detect hallucinations, and applies formalized consistency/integrity rules. The abstract reports that experiments demonstrate high accuracy in key fields and robustness to webpage structure changes, making the approach suitable for near-real-time tasks such as news aggregation and monitoring.
Significance. If the verification procedure and extraction accuracy hold, the work could offer a practical, scalable alternative to brittle rule-based scraping pipelines for dynamic web sources. The hybrid scraping plus LLM normalization addresses a real engineering pain point in data collection. However, the absence of any quantitative metrics, baselines, dataset descriptions, or ground-truth comparisons in the provided text substantially limits the ability to assess whether the claimed accuracy and hallucination elimination are achieved.
major comments (2)
- [Abstract] Abstract: the claim that experiments show 'a high level of accuracy in the completion of key fields' is unsupported because no quantitative metrics, baselines, error bars, dataset sizes, or exclusion criteria are supplied, preventing verification of the data-to-claim link.
- [Verification Process] Description of the two-stage verification process: comparing embeddings of outputs generated at different temperatures measures inter-sample consistency but does not cross-check extracted fields against the original webpage text or any external ground truth; therefore consistent but factually incorrect extractions (e.g., fabricated values that fit the JSON schema) can pass verification.
minor comments (1)
- [Abstract] Abstract contains the typo 'byy comparing' (should be 'by comparing').
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and noting revisions where the manuscript will be updated to strengthen the presentation of results and limitations.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that experiments show 'a high level of accuracy in the completion of key fields' is unsupported because no quantitative metrics, baselines, error bars, dataset sizes, or exclusion criteria are supplied, preventing verification of the data-to-claim link.
Authors: We agree that the abstract claim requires explicit supporting evidence to be verifiable. The full manuscript contains an experiments section with evaluation details, but to directly address the concern we have revised the abstract to incorporate specific quantitative metrics (e.g., precision/recall on key fields), the number of test webpages, and a brief description of the evaluation protocol and exclusion criteria. These additions make the data-to-claim linkage explicit without altering the original findings. revision: yes
-
Referee: [Verification Process] Description of the two-stage verification process: comparing embeddings of outputs generated at different temperatures measures inter-sample consistency but does not cross-check extracted fields against the original webpage text or any external ground truth; therefore consistent but factually incorrect extractions (e.g., fabricated values that fit the JSON schema) can pass verification.
Authors: The referee correctly notes that our verification relies on inter-sample consistency (via embedding comparison across temperature settings) plus formalized integrity rules, rather than direct grounding against the source webpage text or external ground truth. This design detects many hallucinations through inconsistency but cannot rule out coherent yet fabricated values. We have revised the manuscript to explicitly acknowledge this limitation in the verification-process section, to clarify the method's scope as a practical heuristic rather than a complete factual verifier, and to outline planned extensions such as source-text alignment checks. revision: partial
Circularity Check
No circularity; purely procedural description without derivations or self-referential reductions
full rationale
The paper presents a procedural algorithm for web scraping, MongoDB storage, LLM-based JSON extraction, and a two-stage verification step that compares embeddings of outputs generated at different temperatures plus consistency rules. No equations, fitted parameters, derivations, or mathematical claims appear anywhere in the provided text. The verification process is described directly as a design choice rather than derived from prior results or reduced to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing manner. The method is therefore self-contained as an engineering description and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can extract and normalize information from unstructured web text into a predetermined JSON schema when given appropriate prompts
- ad hoc to paper Semantic embeddings of multiple LLM outputs generated at different temperatures can be compared to detect and eliminate hallucinations
Reference graph
Works this paper leans on
-
[1]
Ltd, Pune, Maharashtra, Suhrud Wani, Innoplexus Consulting Services Pvt
Aman Ahluwalia, PerpetualBlock Technologies Pvt. Ltd, Pune, Maharashtra, Suhrud Wani, Innoplexus Consulting Services Pvt. Ltd., Pune, Maharashtr: Leveraging Large Language Models for Web Scraping (https://arxiv.org/pdf/2406.08246), last accessed 2025/12/3
-
[2]
Schwitter, N. Using large language models for preprocessing and information extraction from unstructured text: A proof-of- concept application in the social sciences. Methodological Innovations. 2025; 18
work page 2025
- [3]
-
[4]
Ntinopoulos, V ., Biefer, H., Tudorache, I., Papadopoulos, N., Odavic, D., Risteski, P., Haeussler, A., & Dzemali, O.: Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. BMJ Health & Care Informatics. 2025; 32
work page 2025
-
[5]
Remadi, A., Hage, K., Hobeika, Y ., & Bugiotti, F. To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data. Data Knowl. Eng.. 2024; 152. 9 Method for aggregating unstructured data using LLMA PREPRINT
work page 2024
- [6]
-
[7]
Web Data Extraction, Applications and Techniques: A Survey
Ferrara, E., Meo, P., Fiumara, G., & Baumgartner, R. Web Data Extraction, Applications and Techniques: A Survey. Knowl. Based Syst.. 2012; 70
work page 2012
-
[8]
Fields of Gold: Scraping Web Data for Marketing Insights
Boegershausen, J., Datta, H., Borah, A., & Stephen, A. Fields of Gold: Scraping Web Data for Marketing Insights. Journal of Marketing. 2022; 86
work page 2022
-
[9]
An analytical study of information extraction from unstructured and multidimensional big data
Adnan, K., & Akbar, R. An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data. 2019; 6. (https://doi.org/10.1186/s40537-019-0254-8.)
-
[10]
Wiest, I., Wolf, F., Leßmann, M., Van Treeck, M., Ferber, D., Zhu, J., Boehme, H., Bressem, K., Ulrich, H., Ebert, M., & Kather, J. LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models. medRxiv. 2024
work page 2024
-
[11]
Using LLMs for the Extraction and Normalization of Product Attribute Values
Baumann, N., Brinkmann, A., & Bizer, C. Using LLMs for the Extraction and Normalization of Product Attribute Values. 2024 (https://doi.org/10.1007/978-3-031-70626-4_15), last accessed 2025/12/4
-
[12]
Pichiyan, V ., Muthulingam, S., G, S., Nalajala, S., Ch, A., & Das, M. Web Scraping using Natural Language Processing: Exploiting Unstructured Text for Data Extraction and Analysis. Procedia Computer Science. 2023
work page 2023
-
[13]
Ngoc Ton Ho, Hoang Son Nguyen, Ngoc Minh Chau Ngueyen. In CTU Journal of Innovation and Sustainable Development 16(Special issue: ISDS):58-68: An automated data collection process for constructing graph data relying on LLMs (2024)
work page 2024
- [14]
-
[15]
Rong Hu, Ye Yang, Sen Liu, Zuchen Li, Jingyi Liu, Xingchen Ding, Hanchi Sun, Lingli Ren: Large language model driven transferable key information extraction mechanism for nonstandardized tables. Scientific Reports. 2025
work page 2025
- [16]
- [17]
- [18]
- [19]
- [20]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.