pith. sign in

arxiv: 2510.03761 · v2 · submitted 2025-10-04 · 💻 cs.CR · cs.AI

You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models

Pith reviewed 2026-05-18 10:28 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords information leakagepreprint archivesarXivLaTeX sourcesPII disclosureLLM detectionsecurity auditopen science risks
0
0 comments X

The pith

arXiv submissions expose thousands of personal details, credentials, and private links through unsanitized source files.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that preprint platforms like arXiv release original LaTeX sources, code, and comments alongside PDFs, creating hidden channels for sensitive data to leak. A four-stage analysis of 100,000 submissions totaling 1.2 TB found widespread disclosures including personal information, GPS-tagged images, editable cloud links, GitHub and Google credentials, and API keys. These leaks also include internal author messages and conference submission details that carry reputational risks. If the findings hold, open-science practices require better sanitization to prevent adversaries from harvesting such material at scale.

Core claim

By processing more than 1.2 TB of source data from 100,000 arXiv submissions, the authors applied LaTeXpOsEd to detect thousands of leaks such as PII, GPS-tagged EXIF files, public Google Drive and Dropbox folders, editable SharePoint links, exposed GitHub and Google credentials, cloud API keys, confidential author communications, internal disagreements, and conference submission credentials.

What carries the argument

LaTeXpOsEd, a four-stage framework that combines pattern matching, logical filtering, traditional harvesting techniques, and large language models to identify hidden disclosures in non-referenced files and LaTeX comments.

If this is right

  • Repository operators must add automated sanitization steps for source files before public release.
  • Authors should routinely scan their LaTeX submissions for embedded comments and auxiliary files containing credentials or private links.
  • Conference systems and cloud services used by researchers become higher-value targets once preprint leaks are known to exist.
  • Existing open-access policies may need explicit privacy review processes for non-PDF materials.
  • Similar audits could be repeated on other preprint servers to map the full scope of exposure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same detection approach could be adapted to scan institutional repositories or journal supplementary material for comparable leaks.
  • Widespread awareness of these risks might shift author behavior toward stricter version control and comment removal before upload.
  • Repository policies could evolve to treat source files as potentially sensitive rather than automatically public.
  • If detection accuracy improves, platforms might offer optional pre-submission privacy scans as a service.

Load-bearing premise

The combination of pattern matching and LLM detection reliably flags sensitive information with low false positives across the varied set of 100,000 submissions.

What would settle it

Independent manual verification of a random sample of the reported leaks to check whether they match the claimed categories or contain only false positives.

Figures

Figures reproduced from arXiv: 2510.03761 by Bertalan Borsos, Norbert Tihanyi, Richard A. Dubniczky, Tamas Bisztray.

Figure 1
Figure 1. Figure 1: The LaTeXpOsEd framework: a four-step process for scraping, parsing, mining, and analyzing documents. varied across disciplines, with notable differences between fields such as computer science and economics. Several studies have applied computational linguistics tech￾niques to extract insights from arXiv content [7], [8]. A com￾prehensive analysis of quantitative finance papers from 1997 to 2022 [9] emplo… view at source ↗
Figure 2
Figure 2. Figure 2: Ratio of papers without usable comments in LaTeX source files. arXiv servers. As a result, these files provided a valuable start￾ing point for analysis, leading to the identification of nearly 1,200 images containing sensitive metadata. The types of data represented vary significantly. While device information (e.g., the camera used) or software details (such as the exact version of Photoshop) may already … view at source ↗
Figure 3
Figure 3. Figure 3: The ten most commonly occurring domains in URLs extracted from the comments. Although these domains are noteworthy, they rarely contain sensitive information. Instead, such data is more commonly found in file-sharing portals or private websites, which in some cases expose access tokens directly within the URL. We iden￾tified 206 IP addresses—only eight in private ranges—with ports exposing web servers, dat… view at source ↗
read the original abstract

The widespread use of preprint repositories such as arXiv has accelerated the communication of scientific results but also introduced overlooked security risks. Beyond PDFs, these platforms provide unrestricted access to original source materials, including LaTeX sources, auxiliary code, figures, and embedded comments. In the absence of sanitization, submissions may disclose sensitive information that adversaries can harvest using open-source intelligence. In this work, we present the first large-scale security audit of preprint archives, analyzing more than 1.2 TB of source data from 100,000 arXiv submissions. We introduce LaTeXpOsEd, a four-stage framework that integrates pattern matching, logical filtering, traditional harvesting techniques, and large language models (LLMs) to uncover hidden disclosures within non-referenced files and LaTeX comments. To evaluate LLMs' secret-detection capabilities, we introduce LLMSec-DB, a benchmark on which we tested 25 state-of-the-art models. Our analysis uncovered thousands of PII leaks, GPS-tagged EXIF files, publicly available Google Drive and Dropbox folders, editable private SharePoint links, exposed GitHub and Google credentials, and cloud API keys. We also uncovered confidential author communications, internal disagreements, and conference submission credentials, exposing information that poses serious reputational risks to both researchers and institutions. We urge the research community and repository operators to take immediate action to close these hidden security gaps. To support open science, we release all scripts and methods from this study but withhold sensitive findings that could be misused, in line with ethical principles. The source code and related material are available at the project website https://github.com/LaTeXpOsEd

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents LaTeXpOsEd, a four-stage framework that combines pattern matching, logical filtering, traditional harvesting, and LLMs to audit 1.2 TB of LaTeX source data from 100,000 arXiv submissions for sensitive disclosures. It introduces the LLMSec-DB benchmark to evaluate 25 LLMs on secret detection and reports uncovering thousands of PII leaks, GPS-tagged files, exposed cloud links and credentials, GitHub/Google keys, and confidential author communications, urging repository operators and researchers to address these risks while releasing analysis scripts.

Significance. If the detection pipeline is reliable, the work provides the first large-scale empirical evidence of overlooked information leakage in public preprint archives, highlighting concrete reputational and security risks. The scale of the 1.2 TB dataset, introduction of LLMSec-DB, and release of scripts are strengths that support reproducibility and future research in the area.

major comments (2)
  1. [§4] §4 (LLM-based detection stage): The evaluation on LLMSec-DB reports model performance but does not include a mapping or validation of those scores to precision/recall on the actual arXiv corpus, where LaTeX comments, auxiliary files, and sparse context differ from the benchmark. This directly affects whether the aggregate counts of thousands of leaks can be trusted without significant overcounting.
  2. [§5] §5 (Findings and leak enumeration): The reported scale of exposures (PII, credentials, editable SharePoint links, etc.) is presented as aggregate totals without per-category false-positive rates or human-validated samples drawn from the 100k submissions. Without this, misclassifications such as code variables as API keys remain a load-bearing uncertainty for the central claim.
minor comments (1)
  1. [Abstract] The abstract could include a brief note on the categories or approximate breakdown of the detected leaks to give readers immediate context on the findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The concerns about validation of the LLM stage and per-category error rates are important for establishing the reliability of the reported findings. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (LLM-based detection stage): The evaluation on LLMSec-DB reports model performance but does not include a mapping or validation of those scores to precision/recall on the actual arXiv corpus, where LaTeX comments, auxiliary files, and sparse context differ from the benchmark. This directly affects whether the aggregate counts of thousands of leaks can be trusted without significant overcounting.

    Authors: We agree that the controlled nature of LLMSec-DB does not automatically guarantee identical performance on the heterogeneous arXiv source files. The pipeline applies pattern matching and logical filtering both before and after the LLM stage precisely to mitigate context differences and reduce overcounting. Nevertheless, we did not report a direct precision/recall mapping from benchmark scores to the 100k-submission corpus. In the revision we will add a dedicated validation subsection that describes a manual audit of a stratified random sample of 500 LLM-flagged items drawn from the actual arXiv data, together with category-specific precision estimates and a discussion of how LaTeX comments and auxiliary files were handled. revision: yes

  2. Referee: [§5] §5 (Findings and leak enumeration): The reported scale of exposures (PII, credentials, editable SharePoint links, etc.) is presented as aggregate totals without per-category false-positive rates or human-validated samples drawn from the 100k submissions. Without this, misclassifications such as code variables as API keys remain a load-bearing uncertainty for the central claim.

    Authors: The referee correctly identifies that aggregate counts alone leave open the possibility of systematic misclassification. While the multi-stage design (regex + logical filters + LLM) was intended to suppress false positives such as variable names being mistaken for keys, we did not quantify per-category false-positive rates or present human-validated samples in the submitted version. We will revise §5 to include (i) the results of human review on random samples for each major exposure category and (ii) explicit false-positive rate estimates together with the decision rules used to distinguish code artifacts from genuine credentials. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement of public arXiv sources

full rationale

The paper conducts a large-scale empirical audit by scanning 100,000 public arXiv submissions (1.2 TB of LaTeX sources, comments, and auxiliary files) with pattern matching, logical filters, and LLMs to count disclosures. No mathematical derivation chain, equations, or first-principles results exist that reduce to fitted parameters or self-definitions. LLMSec-DB is presented as a separate benchmark for evaluating 25 models on curated test cases; the main counts are produced by applying the four-stage framework directly to the arXiv corpus rather than deriving them from benchmark scores. The analysis is therefore self-contained against external public data with no load-bearing self-citation, ansatz smuggling, or renaming of known results as novel derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical security measurement study. No mathematical free parameters, domain axioms, or invented entities are required for the central claims.

pith-pipeline@v0.9.0 · 5852 in / 1116 out tokens · 40120 ms · 2026-05-18T10:28:29.644557+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Hidden Secrets in the arXiv: Discovering, Analyzing, and Preventing Unintentional Information Disclosure in Source Files of Scientific Preprints

    cs.CR 2026-04 unverdicted novelty 7.0

    Nearly every arXiv submission leaks hidden sensitive information through its source files, existing cleaners fail, and ALC-NG provides a more reliable fix.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper

  1. [1]

    Publication output by country, region, or economy and scientific field,

    National Science Board, “Publication output by country, region, or economy and scientific field,” 2021, accessed: 2025-09-25. [Online]. Available: https://ncses.nsf.gov/pubs/nsb20214/ publication-output-by-country-region-or-economy-and-scientific-field

  2. [2]

    The not yet exploited goldmine of osint: Opportunities, open challenges and future trends,

    J. Pastor-Galindo, P. Nespoli, F. G ´omez M´armol, and G. Mart´ınez P´erez, “The not yet exploited goldmine of osint: Opportunities, open challenges and future trends,”IEEE Access, vol. 8, pp. 10 282–10 304, 2020

  3. [3]

    Reaper: an automated, scalable solution for mass credential harvesting and osint,

    B. Butler, B. Wardman, and N. Pratt, “Reaper: an automated, scalable solution for mass credential harvesting and osint,” in2016 APWG Symposium on Electronic Crime Research (eCrime), 2016, pp. 1–10

  4. [4]

    Hidden division of labor in scien- tific teams revealed through 1.6 million latex files,

    J. Pei, L. Yang, and L. Wu, “Hidden division of labor in scien- tific teams revealed through 1.6 million latex files,”arXiv preprint arXiv:2502.07263, 2025

  5. [5]

    unarxive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata,

    T. Saier and M. F ¨arber, “unarxive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata,” Scientometrics, vol. 125, no. 3, pp. 3085–3108, 2020

  6. [6]

    Modular versus hierarchical: A structural signature of topic popularity in mathematical research,

    B. Hepler, “Modular versus hierarchical: A structural signature of topic popularity in mathematical research,” 2025. [Online]. Available: https://arxiv.org/abs/2506.22946

  7. [7]

    Classification and clustering of arxiv documents, sections, and abstracts, comparing encodings of natural and mathematical language,

    P. Scharpf, M. Schubotz, A. Youssef, F. Hamborg, N. Meuschke, and B. Gipp, “Classification and clustering of arxiv documents, sections, and abstracts, comparing encodings of natural and mathematical language,” inProceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, 2020, pp. 137–146

  8. [8]

    Anthroscore: A computational linguistic measure of anthropomorphism,

    M. Cheng, K. Gligoric, T. Piccardi, and D. Jurafsky, “Anthroscore: A computational linguistic measure of anthropomorphism,”arXiv preprint arXiv:2402.02056, 2024

  9. [9]

    Text mining arxiv: a look through quantitative finance papers,

    M. L. Bianchi, “Text mining arxiv: a look through quantitative finance papers,” 2024. [Online]. Available: https://arxiv.org/abs/2401.01751

  10. [10]

    Ai academic research aggregator,

    A. T. A, C. Kumar, A. Amal, and A. Kumar, “Ai academic research aggregator,” in2025 3rd International Conference on Smart Systems for applications in Electrical Sciences (ICSSES), 2025, pp. 1–5

  11. [11]

    A bench- mark of pdf information extraction tools using a multi-task and multi- domain evaluation framework for academic documents,

    N. Meuschke, A. Jagdale, T. Spinde, J. Mitrovi ´c, and B. Gipp, “A bench- mark of pdf information extraction tools using a multi-task and multi- domain evaluation framework for academic documents,” inInforma- tion for a Better World: Normality, Virtuality, Physicality, Inclusivity, I. Sserwanga, A. Goulding, H. Moulaison-Sandy, J. T. Du, A. L. Soares, V ....

  12. [12]

    Dynamic intelligence assessment: Benchmarking llms on the road to agi with a focus on model confidence,

    N. Tihanyi, T. Bisztray, R. A. Dubniczky, R. Toth, B. Borsos, B. Cherif, R. Jain, L. Muzsai, M. A. Ferrag, R. Marinelliet al., “Dynamic intelligence assessment: Benchmarking llms on the road to agi with a focus on model confidence,” in2024 IEEE International Conference on Big Data (BigData). IEEE, 2024, pp. 3313–3321

  13. [13]

    Castle: Benchmarking dataset for static code analyzers and llms towards cwe detection,

    R. A. Dubniczky, K. Z. Horv ´at, T. Bisztray, M. A. Ferrag, L. C. Cordeiro, and N. Tihanyi, “Castle: Benchmarking dataset for static code analyzers and llms towards cwe detection,” inInternational Symposium on Theoretical Aspects of Software Engineering. Springer, 2025, pp. 253–272

  14. [14]

    Secret breach detection in source code with large language models,

    M. N. Rahman, S. Ahmed, Z. Wahab, S. M. Sohan, and R. Shahriyar, “Secret breach detection in source code with large language models,”

  15. [15]

    Available: https://arxiv.org/abs/2504.18784

    [Online]. Available: https://arxiv.org/abs/2504.18784

  16. [16]

    Metadata practices for consumer photos,

    J. Tesic, “Metadata practices for consumer photos,”IEEE MultiMedia, vol. 12, no. 3, pp. 86–92, 2005

  17. [17]

    Recent advances in named entity recognition: A comprehensive survey and comparative study,

    I. Keraghel, S. Morbieu, and M. Nadif, “Recent advances in named entity recognition: A comprehensive survey and comparative study,”

  18. [18]

    Seongyun Lee, Hyunjae Kim, and Jaewoo Kang

    [Online]. Available: https://arxiv.org/abs/2401.10825

  19. [19]

    arxiv hits 12k in may,

    J. Entwood, “arxiv hits 12k in may,” arXiv Blog, May 2018, accessed 2025-09-17. [Online]. Available: https://blog.arxiv.org/2018/ 05/31/arxiv-hits-12k-in-may/

  20. [20]

    Secrets patterns db: The largest open-source database for detecting secrets, api keys, passwords, tokens, and more,

    M. Ahmed, “Secrets patterns db: The largest open-source database for detecting secrets, api keys, passwords, tokens, and more,” GitHub repository, 2023, cC-BY-SA-4.0 license. [Online]. Available: https://github.com/mazen160/secrets-patterns-db