You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models
Pith reviewed 2026-05-18 10:28 UTC · model grok-4.3
The pith
arXiv submissions expose thousands of personal details, credentials, and private links through unsanitized source files.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By processing more than 1.2 TB of source data from 100,000 arXiv submissions, the authors applied LaTeXpOsEd to detect thousands of leaks such as PII, GPS-tagged EXIF files, public Google Drive and Dropbox folders, editable SharePoint links, exposed GitHub and Google credentials, cloud API keys, confidential author communications, internal disagreements, and conference submission credentials.
What carries the argument
LaTeXpOsEd, a four-stage framework that combines pattern matching, logical filtering, traditional harvesting techniques, and large language models to identify hidden disclosures in non-referenced files and LaTeX comments.
If this is right
- Repository operators must add automated sanitization steps for source files before public release.
- Authors should routinely scan their LaTeX submissions for embedded comments and auxiliary files containing credentials or private links.
- Conference systems and cloud services used by researchers become higher-value targets once preprint leaks are known to exist.
- Existing open-access policies may need explicit privacy review processes for non-PDF materials.
- Similar audits could be repeated on other preprint servers to map the full scope of exposure.
Where Pith is reading between the lines
- The same detection approach could be adapted to scan institutional repositories or journal supplementary material for comparable leaks.
- Widespread awareness of these risks might shift author behavior toward stricter version control and comment removal before upload.
- Repository policies could evolve to treat source files as potentially sensitive rather than automatically public.
- If detection accuracy improves, platforms might offer optional pre-submission privacy scans as a service.
Load-bearing premise
The combination of pattern matching and LLM detection reliably flags sensitive information with low false positives across the varied set of 100,000 submissions.
What would settle it
Independent manual verification of a random sample of the reported leaks to check whether they match the claimed categories or contain only false positives.
Figures
read the original abstract
The widespread use of preprint repositories such as arXiv has accelerated the communication of scientific results but also introduced overlooked security risks. Beyond PDFs, these platforms provide unrestricted access to original source materials, including LaTeX sources, auxiliary code, figures, and embedded comments. In the absence of sanitization, submissions may disclose sensitive information that adversaries can harvest using open-source intelligence. In this work, we present the first large-scale security audit of preprint archives, analyzing more than 1.2 TB of source data from 100,000 arXiv submissions. We introduce LaTeXpOsEd, a four-stage framework that integrates pattern matching, logical filtering, traditional harvesting techniques, and large language models (LLMs) to uncover hidden disclosures within non-referenced files and LaTeX comments. To evaluate LLMs' secret-detection capabilities, we introduce LLMSec-DB, a benchmark on which we tested 25 state-of-the-art models. Our analysis uncovered thousands of PII leaks, GPS-tagged EXIF files, publicly available Google Drive and Dropbox folders, editable private SharePoint links, exposed GitHub and Google credentials, and cloud API keys. We also uncovered confidential author communications, internal disagreements, and conference submission credentials, exposing information that poses serious reputational risks to both researchers and institutions. We urge the research community and repository operators to take immediate action to close these hidden security gaps. To support open science, we release all scripts and methods from this study but withhold sensitive findings that could be misused, in line with ethical principles. The source code and related material are available at the project website https://github.com/LaTeXpOsEd
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents LaTeXpOsEd, a four-stage framework that combines pattern matching, logical filtering, traditional harvesting, and LLMs to audit 1.2 TB of LaTeX source data from 100,000 arXiv submissions for sensitive disclosures. It introduces the LLMSec-DB benchmark to evaluate 25 LLMs on secret detection and reports uncovering thousands of PII leaks, GPS-tagged files, exposed cloud links and credentials, GitHub/Google keys, and confidential author communications, urging repository operators and researchers to address these risks while releasing analysis scripts.
Significance. If the detection pipeline is reliable, the work provides the first large-scale empirical evidence of overlooked information leakage in public preprint archives, highlighting concrete reputational and security risks. The scale of the 1.2 TB dataset, introduction of LLMSec-DB, and release of scripts are strengths that support reproducibility and future research in the area.
major comments (2)
- [§4] §4 (LLM-based detection stage): The evaluation on LLMSec-DB reports model performance but does not include a mapping or validation of those scores to precision/recall on the actual arXiv corpus, where LaTeX comments, auxiliary files, and sparse context differ from the benchmark. This directly affects whether the aggregate counts of thousands of leaks can be trusted without significant overcounting.
- [§5] §5 (Findings and leak enumeration): The reported scale of exposures (PII, credentials, editable SharePoint links, etc.) is presented as aggregate totals without per-category false-positive rates or human-validated samples drawn from the 100k submissions. Without this, misclassifications such as code variables as API keys remain a load-bearing uncertainty for the central claim.
minor comments (1)
- [Abstract] The abstract could include a brief note on the categories or approximate breakdown of the detected leaks to give readers immediate context on the findings.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The concerns about validation of the LLM stage and per-category error rates are important for establishing the reliability of the reported findings. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [§4] §4 (LLM-based detection stage): The evaluation on LLMSec-DB reports model performance but does not include a mapping or validation of those scores to precision/recall on the actual arXiv corpus, where LaTeX comments, auxiliary files, and sparse context differ from the benchmark. This directly affects whether the aggregate counts of thousands of leaks can be trusted without significant overcounting.
Authors: We agree that the controlled nature of LLMSec-DB does not automatically guarantee identical performance on the heterogeneous arXiv source files. The pipeline applies pattern matching and logical filtering both before and after the LLM stage precisely to mitigate context differences and reduce overcounting. Nevertheless, we did not report a direct precision/recall mapping from benchmark scores to the 100k-submission corpus. In the revision we will add a dedicated validation subsection that describes a manual audit of a stratified random sample of 500 LLM-flagged items drawn from the actual arXiv data, together with category-specific precision estimates and a discussion of how LaTeX comments and auxiliary files were handled. revision: yes
-
Referee: [§5] §5 (Findings and leak enumeration): The reported scale of exposures (PII, credentials, editable SharePoint links, etc.) is presented as aggregate totals without per-category false-positive rates or human-validated samples drawn from the 100k submissions. Without this, misclassifications such as code variables as API keys remain a load-bearing uncertainty for the central claim.
Authors: The referee correctly identifies that aggregate counts alone leave open the possibility of systematic misclassification. While the multi-stage design (regex + logical filters + LLM) was intended to suppress false positives such as variable names being mistaken for keys, we did not quantify per-category false-positive rates or present human-validated samples in the submitted version. We will revise §5 to include (i) the results of human review on random samples for each major exposure category and (ii) explicit false-positive rate estimates together with the decision rules used to distinguish code artifacts from genuine credentials. revision: yes
Circularity Check
No circularity: direct empirical measurement of public arXiv sources
full rationale
The paper conducts a large-scale empirical audit by scanning 100,000 public arXiv submissions (1.2 TB of LaTeX sources, comments, and auxiliary files) with pattern matching, logical filters, and LLMs to count disclosures. No mathematical derivation chain, equations, or first-principles results exist that reduce to fitted parameters or self-definitions. LLMSec-DB is presented as a separate benchmark for evaluating 25 models on curated test cases; the main counts are produced by applying the four-stage framework directly to the arXiv corpus rather than deriving them from benchmark scores. The analysis is therefore self-contained against external public data with no load-bearing self-citation, ansatz smuggling, or renaming of known results as novel derivations.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce LaTeXpOsEd, a four-stage framework that integrates pattern matching, logical filtering, traditional harvesting techniques, and large language models (LLMs) to uncover hidden disclosures within non-referenced files and LaTeX comments.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our analysis uncovered thousands of PII leaks, GPS-tagged EXIF files, publicly available Google Drive and Dropbox folders, editable private SharePoint links, exposed GitHub and Google credentials, and cloud API keys.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Hidden Secrets in the arXiv: Discovering, Analyzing, and Preventing Unintentional Information Disclosure in Source Files of Scientific Preprints
Nearly every arXiv submission leaks hidden sensitive information through its source files, existing cleaners fail, and ALC-NG provides a more reliable fix.
Reference graph
Works this paper leans on
-
[1]
Publication output by country, region, or economy and scientific field,
National Science Board, “Publication output by country, region, or economy and scientific field,” 2021, accessed: 2025-09-25. [Online]. Available: https://ncses.nsf.gov/pubs/nsb20214/ publication-output-by-country-region-or-economy-and-scientific-field
work page 2021
-
[2]
The not yet exploited goldmine of osint: Opportunities, open challenges and future trends,
J. Pastor-Galindo, P. Nespoli, F. G ´omez M´armol, and G. Mart´ınez P´erez, “The not yet exploited goldmine of osint: Opportunities, open challenges and future trends,”IEEE Access, vol. 8, pp. 10 282–10 304, 2020
work page 2020
-
[3]
Reaper: an automated, scalable solution for mass credential harvesting and osint,
B. Butler, B. Wardman, and N. Pratt, “Reaper: an automated, scalable solution for mass credential harvesting and osint,” in2016 APWG Symposium on Electronic Crime Research (eCrime), 2016, pp. 1–10
work page 2016
-
[4]
Hidden division of labor in scien- tific teams revealed through 1.6 million latex files,
J. Pei, L. Yang, and L. Wu, “Hidden division of labor in scien- tific teams revealed through 1.6 million latex files,”arXiv preprint arXiv:2502.07263, 2025
-
[5]
T. Saier and M. F ¨arber, “unarxive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata,” Scientometrics, vol. 125, no. 3, pp. 3085–3108, 2020
work page 2020
-
[6]
Modular versus hierarchical: A structural signature of topic popularity in mathematical research,
B. Hepler, “Modular versus hierarchical: A structural signature of topic popularity in mathematical research,” 2025. [Online]. Available: https://arxiv.org/abs/2506.22946
-
[7]
P. Scharpf, M. Schubotz, A. Youssef, F. Hamborg, N. Meuschke, and B. Gipp, “Classification and clustering of arxiv documents, sections, and abstracts, comparing encodings of natural and mathematical language,” inProceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, 2020, pp. 137–146
work page 2020
-
[8]
Anthroscore: A computational linguistic measure of anthropomorphism,
M. Cheng, K. Gligoric, T. Piccardi, and D. Jurafsky, “Anthroscore: A computational linguistic measure of anthropomorphism,”arXiv preprint arXiv:2402.02056, 2024
-
[9]
Text mining arxiv: a look through quantitative finance papers,
M. L. Bianchi, “Text mining arxiv: a look through quantitative finance papers,” 2024. [Online]. Available: https://arxiv.org/abs/2401.01751
-
[10]
Ai academic research aggregator,
A. T. A, C. Kumar, A. Amal, and A. Kumar, “Ai academic research aggregator,” in2025 3rd International Conference on Smart Systems for applications in Electrical Sciences (ICSSES), 2025, pp. 1–5
work page 2025
-
[11]
N. Meuschke, A. Jagdale, T. Spinde, J. Mitrovi ´c, and B. Gipp, “A bench- mark of pdf information extraction tools using a multi-task and multi- domain evaluation framework for academic documents,” inInforma- tion for a Better World: Normality, Virtuality, Physicality, Inclusivity, I. Sserwanga, A. Goulding, H. Moulaison-Sandy, J. T. Du, A. L. Soares, V ....
work page 2023
-
[12]
N. Tihanyi, T. Bisztray, R. A. Dubniczky, R. Toth, B. Borsos, B. Cherif, R. Jain, L. Muzsai, M. A. Ferrag, R. Marinelliet al., “Dynamic intelligence assessment: Benchmarking llms on the road to agi with a focus on model confidence,” in2024 IEEE International Conference on Big Data (BigData). IEEE, 2024, pp. 3313–3321
work page 2024
-
[13]
Castle: Benchmarking dataset for static code analyzers and llms towards cwe detection,
R. A. Dubniczky, K. Z. Horv ´at, T. Bisztray, M. A. Ferrag, L. C. Cordeiro, and N. Tihanyi, “Castle: Benchmarking dataset for static code analyzers and llms towards cwe detection,” inInternational Symposium on Theoretical Aspects of Software Engineering. Springer, 2025, pp. 253–272
work page 2025
-
[14]
Secret breach detection in source code with large language models,
M. N. Rahman, S. Ahmed, Z. Wahab, S. M. Sohan, and R. Shahriyar, “Secret breach detection in source code with large language models,”
-
[15]
Available: https://arxiv.org/abs/2504.18784
[Online]. Available: https://arxiv.org/abs/2504.18784
-
[16]
Metadata practices for consumer photos,
J. Tesic, “Metadata practices for consumer photos,”IEEE MultiMedia, vol. 12, no. 3, pp. 86–92, 2005
work page 2005
-
[17]
Recent advances in named entity recognition: A comprehensive survey and comparative study,
I. Keraghel, S. Morbieu, and M. Nadif, “Recent advances in named entity recognition: A comprehensive survey and comparative study,”
-
[18]
Seongyun Lee, Hyunjae Kim, and Jaewoo Kang
[Online]. Available: https://arxiv.org/abs/2401.10825
-
[19]
J. Entwood, “arxiv hits 12k in may,” arXiv Blog, May 2018, accessed 2025-09-17. [Online]. Available: https://blog.arxiv.org/2018/ 05/31/arxiv-hits-12k-in-may/
work page 2018
-
[20]
M. Ahmed, “Secrets patterns db: The largest open-source database for detecting secrets, api keys, passwords, tokens, and more,” GitHub repository, 2023, cC-BY-SA-4.0 license. [Online]. Available: https://github.com/mazen160/secrets-patterns-db
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.