pith. machine review for the scientific record. sign in

arxiv: 2604.20927 · v1 · submitted 2026-04-22 · 💻 cs.CR

Recognition: unknown

Hidden Secrets in the arXiv: Discovering, Analyzing, and Preventing Unintentional Information Disclosure in Source Files of Scientific Preprints

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:39 UTC · model grok-4.3

classification 💻 cs.CR
keywords arXiv preprintssource file disclosureLaTeX metadatainformation leakageunintentional leaksGit history exposureAPI key leakage
0
0 comments X

The pith

Nearly every arXiv preprint source file unintentionally discloses sensitive information such as API keys, Git histories, and internal links.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines source files from all 2.7 million arXiv submissions that include them, checking three categories of potential leaks: extra files that are not needed for compilation, metadata hidden inside files, and comments or other irrelevant content in the code. It finds that some form of such disclosure appears in almost every submission, including links to editable internal documents, API keys, private keys, and complete Git histories. Existing tools that claim to clean these files do not reliably remove the unwanted content. The authors therefore introduce ALC-NG as a tool that more comprehensively strips out unnecessary files, metadata, and comments from LaTeX sources.

Core claim

A systematic scan across three dimensions of source-file content shows that unintentional information disclosure is ubiquitous in arXiv preprints, and that current sanitization utilities fail to eliminate it.

What carries the argument

A three-dimensional scan of unnecessary files, embedded metadata, and irrelevant content such as comments, applied to the full set of 2.7 million arXiv source submissions.

If this is right

  • Researchers must inspect their source bundles before arXiv upload if they wish to avoid exposing internal coordination links, keys, or repository histories.
  • Existing cleaning utilities cannot be trusted to produce safe source archives.
  • ALC-NG provides a practical method for removing files, metadata, and comments that are not required to compile the paper.
  • The public availability of LaTeX sources on arXiv creates a persistent leakage channel that authors and platforms must actively manage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Preprint platforms that publish sources could integrate automated cleaning at upload time to reduce the risk without limiting openness.
  • The same leakage pattern likely appears on other servers that distribute author-provided source archives.
  • Authors who maintain clean, dedicated submission repositories rather than full development histories would avoid many of the detected disclosures.

Load-bearing premise

The items flagged as sensitive are genuine unintentional disclosures rather than items the authors deliberately chose to share, and the automated detection rules produce few false positives.

What would settle it

A manual review of a random sample of flagged items that finds most of them are either false positives or were intentionally left in the source.

Figures

Figures reproduced from arXiv: 2604.20927 by David Sch\"utte, Jan Pennekamp, Johannes Lohm\"oller, Joscha Loos, Martin Henze.

Figure 1
Figure 1. Figure 1: LATEX sources might contain comments or dangling files that are irrelevant to or unused in the compiled PDF, respectively. When being distributed regardless, e.g., via arXiv, they disclose (sensitive) information. code enables research such as citation prediction [7], studying inconsistencies in documents produced using LATEX [8], or reconstructing LATEX sources from PDFs [9], [10]. However, besides provid… view at source ↗
Figure 2
Figure 2. Figure 2: Annual volume of arXiv submissions with and without sources. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Complementary cumulative distribution functions (1-CDFs) of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Development of the prevalence of “hidden” information over time shows an upward trend (left) as well as across aggregated arXiv categories and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Subsequent versions rather add unique data over the first version. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: File modification timestamps allow for deriving author behavior [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Occurrence of URLs in our dataset grouped by protocol (left) and [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: The share of submissions that feature “hidden” information is [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Longitudinal analysis of the categories of “hidden” information [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of over- and underremoved dangling files when [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

Preprints are essential for the timely and open dissemination of research. arXiv, the most widely used preprint service, takes the idea of open science one step further by not only publishing the actual preprints but also LaTeX sources and other files used to create them. As known from other contexts, such as GitHub repositories, and anecdotally exemplified for arXiv, making source code publicly available risks disclosing otherwise "hidden" information. Consequently, the public availability of paper sources raises the question of how much sensitive content is (unintentionally) disclosed through them. In this paper, we systematically answer this question for all 2.7M arXiv submissions with available source files across three dimensions of source file-induced information disclosure: (1) inclusion of unnecessary files, (2) metadata embedded in files, and (3) irrelevant content in files such as source code comments. Our analysis reveals that nearly every arXiv submission contains some form of "hidden" information. Notable findings range from links to editable web documents for internal coordination over API and private keys to complete Git histories. While different tools promise to remove such information from source files, we show that they fail to reliably achieve the intended cleaning functionality. To mitigate this situation, we provide ALC-NG to comprehensively remove files, metadata, and comments that are not needed to compile a LaTeX paper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a large-scale empirical analysis of source files from 2.7M arXiv submissions, examining unintentional information disclosure along three dimensions: inclusion of unnecessary files, embedded metadata, and irrelevant content such as comments. It concludes that nearly every submission contains some form of hidden information, with examples including links to editable documents, API/private keys, and full Git histories. The authors further show that existing cleaning tools are unreliable and introduce ALC-NG as a comprehensive solution for removing non-essential elements from LaTeX sources.

Significance. If the detection accuracy holds, the work is significant for highlighting privacy risks in open-science practices at unprecedented scale. The direct analysis of public arXiv data and the release of ALC-NG represent practical contributions that could inform submission guidelines and tool development in the field.

major comments (2)
  1. [Section 3 (Detection Methodology)] The central prevalence claim ('nearly every' submission contains hidden information) rests on automated detection across 2.7M files, yet no precision evaluation, false-positive rates, or manual validation on a labeled sample is reported for the heuristics identifying API keys, private keys, Git histories, or internal coordination links. This directly undermines the measurement and the subsequent claim that cleaning tools fail.
  2. [Section 5 (Tool Evaluation)] The evaluation that existing tools 'fail to reliably achieve the intended cleaning functionality' lacks quantitative metrics (e.g., before/after precision-recall on a held-out set of files containing the identified sensitive patterns) or details on how ALC-NG was benchmarked against them.
minor comments (2)
  1. [Abstract] The abstract states high-level findings without quantifying the exact fraction of submissions affected or providing confidence intervals for the 'nearly every' statistic.
  2. [Section 3] Detection patterns and file-type classifications would benefit from an explicit table or appendix listing the regular expressions or rules used, to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional validation and quantitative evaluation as outlined.

read point-by-point responses
  1. Referee: [Section 3 (Detection Methodology)] The central prevalence claim ('nearly every' submission contains hidden information) rests on automated detection across 2.7M files, yet no precision evaluation, false-positive rates, or manual validation on a labeled sample is reported for the heuristics identifying API keys, private keys, Git histories, or internal coordination links. This directly undermines the measurement and the subsequent claim that cleaning tools fail.

    Authors: We acknowledge that the original manuscript does not report precision, false-positive rates, or manual validation for the detection heuristics. The heuristics were implemented conservatively using established patterns (e.g., regex for keys and directory checks for Git histories) to reduce false positives, and the large-scale results are supported by numerous concrete examples of disclosures. However, we agree that explicit validation metrics would strengthen the prevalence claim and the argument about cleaning tools. In the revision we will add a dedicated validation subsection to Section 3, including manual review of a random sample of 500 detections per category with reported precision and false-positive rates. revision: yes

  2. Referee: [Section 5 (Tool Evaluation)] The evaluation that existing tools 'fail to reliably achieve the intended cleaning functionality' lacks quantitative metrics (e.g., before/after precision-recall on a held-out set of files containing the identified sensitive patterns) or details on how ALC-NG was benchmarked against them.

    Authors: The current Section 5 evaluation demonstrates tool failures through systematic analysis and case studies on real arXiv submissions. We agree that the absence of quantitative before/after metrics limits the strength of the comparison. In the revised manuscript we will add a quantitative benchmark subsection that uses a held-out test set of files containing the identified sensitive patterns, reporting precision, recall, and F1 scores for existing tools versus ALC-NG, along with full details of the benchmarking procedure. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with no derivations or self-referential claims

full rationale

This paper performs a direct empirical scan of 2.7M public arXiv source files across three dimensions of disclosure (unnecessary files, metadata, irrelevant content). No equations, fitted parameters, predictions, or uniqueness theorems appear; claims rest on observable counts from public data rather than any construction that reduces to the paper's own inputs or prior self-citations. The measurement pipeline is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on domain assumptions about what counts as 'unnecessary' or 'sensitive' content rather than on mathematical axioms or fitted parameters.

axioms (1)
  • domain assumption Certain content such as source comments, metadata, and non-compilation files constitutes unintentional disclosure when made public.
    This definition underpins the three dimensions of analysis and the design of ALC-NG.

pith-pipeline@v0.9.0 · 5573 in / 1246 out tokens · 38810 ms · 2026-05-10T00:39:08.277131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

128 extracted references · 17 canonical work pages · 2 internal anchors

  1. [1]

    arXiv.org: the Los Alamos National Laboratory e-print server Available to Purchase,

    G. McKiernan, “arXiv.org: the Los Alamos National Laboratory e-print server Available to Purchase,”IJGL, vol. 1, no. 3, 2000

  2. [2]

    Let’s discuss: is arXiv always good?

    citeordie, “Let’s discuss: is arXiv always good?” https://archive.is/ ZQ5u2, 2015

  3. [3]

    arXiv.org e-Print archive,

    Cornell Tech, “arXiv.org e-Print archive,” https://arxiv.org/, 1991

  4. [4]

    Is preprint the future of science? A thirty year journey of online preprint services,

    B. Xieet al., “Is preprint the future of science? A thirty year journey of online preprint services,” arXiv:2102.09066, 2021

  5. [5]

    Monthly Submissions,

    arXiv, “Monthly Submissions,” https://arxiv.org/stats/monthly_ submissions, 2014

  6. [6]

    Requiring TeX when possible,

    arXiv, “Requiring TeX when possible,” https://arxiv.org/help/faq/ whytex.html, 2005

  7. [7]

    Overview of the 2003 KDD Cup,

    J. Gehrkeet al., “Overview of the 2003 KDD Cup,”SIGKDD Explor., vol. 5, no. 2, 2003

  8. [8]

    Inconsistencies in TeX-Produced Documents,

    J. Tan and M. Rigger, “Inconsistencies in TeX-Produced Documents,” inISSTA, 2024

  9. [9]

    Challenges in End-to-End Neural Scientific Table Recognition,

    Y . Denget al., “Challenges in End-to-End Neural Scientific Table Recognition,” inICDAR, 2019

  10. [10]

    LATTE: Improving Latex Recognition for Tables and Formulae with Iterative Refinement,

    N. Jianget al., “LATTE: Improving Latex Recognition for Tables and Formulae with Iterative Refinement,”AAAI, vol. 39, no. 4, 2025

  11. [11]

    A framework for improving the accessibility of research papers on arXiv.org,

    S. Brinnet al., “A framework for improving the accessibility of research papers on arXiv.org,” arXiv:2212.07286, 2022

  12. [12]

    Information Leakage Caused by Hidden Data in Published Documents,

    S. Byers, “Information Leakage Caused by Hidden Data in Published Documents,”SP, vol. 2, no. 2, 2004

  13. [13]

    Scanning electronic documents for personally identifiable information,

    T. Auraet al., “Scanning electronic documents for personally identifiable information,” inWPES, 2006

  14. [14]

    Taking advantages of a disadvantage: Digital forensics and steganography using document metadata,

    A. Castiglioneet al., “Taking advantages of a disadvantage: Digital forensics and steganography using document metadata,”JSS, vol. 80, no. 5, 2007

  15. [15]

    Disclosing Private Information from Metadata, hidden info and lost data,

    C. Alonsoet al., “Disclosing Private Information from Metadata, hidden info and lost data,” Black Hat Europe 2009, 2009

  16. [16]

    Leaking Sensitive Information in Complex Docu- ment Files–and How to Prevent It,

    S. L. Garfinkel, “Leaking Sensitive Information in Complex Docu- ment Files–and How to Prevent It,”SP, vol. 12, no. 1, 2013

  17. [17]

    A Systematic Method on PDF Privacy Leakage Issues,

    Y . Fenget al., “A Systematic Method on PDF Privacy Leakage Issues,” inTrustCom, 2018

  18. [18]

    Exploitation and Sanitization of Hidden Data in PDF Files: Do Security Agencies Sanitize Their PDF Files?

    S. Adhatarao and C. Lauradoux, “Exploitation and Sanitization of Hidden Data in PDF Files: Do Security Agencies Sanitize Their PDF Files?” inIH&MMSec, 2021

  19. [19]

    Overheard on Quant-Ph,

    QuantPhComments, “Overheard on Quant-Ph,” https://archive.is/ RyI9q, 2018

  20. [20]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4 (Uncommented Version),

    DV2559106965076, “Sparks of Artificial General Intelligence: Early experiments with GPT-4 (Uncommented Version),” https://archive.is/ 1icMv, 2023

  21. [21]

    arxiv_latex_cleaner,

    Google Research, “arxiv_latex_cleaner,” https://github.com/google- research/arxiv-latex-cleaner, 2019

  22. [22]

    arXiv LaTeX cleaner: safer and easier open source research papers,

    J. Pont-Tuset, “arXiv LaTeX cleaner: safer and easier open source research papers,” https://archive.is/0Fmqz, 2019

  23. [23]

    Scienceography: The Study of How Science Is Written,

    G. Cormodeet al., “Scienceography: The Study of How Science Is Written,” inLNCS, 2012

  24. [24]

    Studying the source code of scientific research,

    G. Cormodeet al., “Studying the source code of scientific research,” SIGKDD Explor., vol. 14, no. 2, 2013

  25. [25]

    Artifact: Hidden Secrets in the arXiv,

    J. Pennekampet al., “Artifact: Hidden Secrets in the arXiv,” https://zenodo.org/record/19366799, 2026, see also: https://arxiv.comsys.rwth-aachen.de

  26. [26]

    Pennekampet al., “ALC-NG,” https://github.com/COMSYS/ALC- NG, 2026, see also: http://alc-ng.de

    J. Pennekampet al., “ALC-NG,” https://github.com/COMSYS/ALC- NG, 2026, see also: http://alc-ng.de

  27. [27]

    A Guide to Posting and Managing Preprints,

    H. Moshontzet al., “A Guide to Posting and Managing Preprints,” AMPPS, vol. 4, no. 2, 2021

  28. [28]

    Directory of Open Access Preprint Repositories,

    COAR and CCSD, “Directory of Open Access Preprint Repositories,” https://doapr.coar-repositories.org/repositories/, 2022

  29. [29]

    arXiv Annual Report 2023,

    arXiv, “arXiv Annual Report 2023,” Cornell Tech, Tech. Rep., 2024

  30. [30]

    Information seeking behavior of scientists in the electronic information age: Astronomers, chemists, mathematicians, and physicists,

    C. M. Brown, “Information seeking behavior of scientists in the electronic information age: Astronomers, chemists, mathematicians, and physicists,”JASIST, vol. 50, no. 10, 1999

  31. [31]

    The Place of E-Prints in the Publication Patterns of Physical Scientists,

    K. Manuel, “The Place of E-Prints in the Publication Patterns of Physical Scientists,”Sci. Technol. Libr., vol. 20, no. 1, 2001

  32. [32]

    How many preprints have actually been printed and why: a case study of computer science preprints on arXiv,

    J. Linet al., “How many preprints have actually been printed and why: a case study of computer science preprints on arXiv,”Scientometrics, vol. 124, no. 1, 2020

  33. [33]

    Monthly Download Rates,

    arXiv, “Monthly Download Rates,” https://arxiv.org/stats/monthly_ downloads, 2014

  34. [34]

    ArXiv screens spot fake papers,

    P. Ginsparg, “ArXiv screens spot fake papers,”Nature, vol. 508, no. 7494, 2014

  35. [35]

    WithdrarXiv: A Large-Scale Dataset for Retraction Study,

    D. Raoet al., “WithdrarXiv: A Large-Scale Dataset for Retraction Study,” arXiv:2412.03775, 2024

  36. [36]

    arXiv E-prints and the journal of record: An analysis of roles and relationships,

    V . Larivièreet al., “arXiv E-prints and the journal of record: An analysis of roles and relationships,”JASIST, vol. 65, no. 6, 2014

  37. [37]

    The effect of use and access on citations,

    M. J. Kurtzet al., “The effect of use and access on citations,”IP&M, vol. 41, no. 6, 2005

  38. [38]

    Does the arXiv lead to higher citations and reduced publisher downloads for mathematics articles?

    P. M. Davis and M. J. Fromerth, “Does the arXiv lead to higher citations and reduced publisher downloads for mathematics articles?” Scientometrics, vol. 71, no. 2, 2007

  39. [39]

    The effect of “open access

    H. F. Moed, “The effect of “open access” on citation impact: An analysis of ArXiv’s condensed matter section,”JASIST, vol. 58, no. 13, 2007

  40. [40]

    Positional effects on citation and readership in arXiv,

    A.-u. Haque and P. Ginsparg, “Positional effects on citation and readership in arXiv,”JASIST, vol. 60, no. 11, 2009

  41. [41]

    Last but not least: Additional positional effects on citation and readership in arXiv,

    A.-u. Haque and P. Ginsparg, “Last but not least: Additional positional effects on citation and readership in arXiv,”JASIST, vol. 61, no. 12, 2010

  42. [42]

    On the Use of ArXiv as a Dataset

    C. B. Clementet al., “On the Use of ArXiv as a Dataset,” arXiv:1905.00075, 2019, representation Learning on Graphs and Manifolds (RLGM) Workshop 2019

  43. [43]

    unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata,

    T. Saier and M. Färber, “unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata,”Scientometrics, vol. 125, no. 3, 2020

  44. [44]

    Text Mining arXiv: A Look Through Quantitative Finance Papers,

    M. L. Bianchi, “Text Mining arXiv: A Look Through Quantitative Finance Papers,”Mathematics, vol. 13, no. 9, 2025

  45. [45]

    Modular versus Hierarchical: A Structural Signature of Topic Popularity in Mathematical Research,

    B. Hepler, “Modular versus Hierarchical: A Structural Signature of Topic Popularity in Mathematical Research,” arXiv:2506.22946, 2025

  46. [46]

    Hidden Division of Labor in Scientific Teams Revealed Through 1.6 Million LaTeX Files,

    J. Peiet al., “Hidden Division of Labor in Scientific Teams Revealed Through 1.6 Million LaTeX Files,” arXiv:2502.07263, 2025

  47. [47]

    Writing Patterns Reveal a Hidden Division of Labor in Scientific Teams,

    L. Yanget al., “Writing Patterns Reveal a Hidden Division of Labor in Scientific Teams,” arXiv:2504.14093, 2025

  48. [48]

    Plagiarism Detection in arXiv,

    D. Sorokinaet al., “Plagiarism Detection in arXiv,” inICDM, 2006

  49. [49]

    Patterns of text reuse in a scientific corpus,

    D. T. Citron and P. Ginsparg, “Patterns of text reuse in a scientific corpus,”PNAS, vol. 112, no. 1, 2015

  50. [50]

    Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection Tool,

    A. Akram, “Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection Tool,” arXiv:2403.13812, 2024

  51. [51]

    A Literature Review on Text Classification and Sentiment Analysis Approaches,

    W. Daweiet al., “A Literature Review on Text Classification and Sentiment Analysis Approaches,” inLNEE, 2020

  52. [52]

    Scientific Statement Classification over arXiv.org,

    D. Ginev and B. R. Miller, “Scientific Statement Classification over arXiv.org,” inLREC, 2020

  53. [53]

    Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language,

    P. Scharpfet al., “Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language,” inACM/IEEE JCDL, 2020

  54. [54]

    Reconstructing LaTeX Source Files from Generated PDFs — a Neural Network Approach,

    B. Safnuk and G. Hu, “Reconstructing LaTeX Source Files from Generated PDFs — a Neural Network Approach,” inINDIN, 2018

  55. [55]

    DocBank: A Benchmark Dataset for Document Layout Analysis,

    M. Liet al., “DocBank: A Benchmark Dataset for Document Layout Analysis,” inCOLING, 2020

  56. [56]

    A Benchmark of PDF Information Extraction Tools Using a Multi-task and Multi-domain Evaluation Framework for Academic Documents,

    N. Meuschkeet al., “A Benchmark of PDF Information Extraction Tools Using a Multi-task and Multi-domain Evaluation Framework for Academic Documents,” inLNCS, 2023

  57. [57]

    Detecting and Mitigating Secret-Key Leaks in Source Code Repositories,

    V . S. Sinhaet al., “Detecting and Mitigating Secret-Key Leaks in Source Code Repositories,” inMSR, 2015

  58. [58]

    How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories,

    M. Meliet al., “How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories,” inNDSS, 2019

  59. [59]

    Automated Detection of Password Leakage from Public GitHub Repositories,

    R. Fenget al., “Automated Detection of Password Leakage from Public GitHub Repositories,” inICSE, 2022

  60. [60]

    SecretBench: A Dataset of Software Secrets,

    S. K. Basaket al., “SecretBench: A Dataset of Software Secrets,” in MSR, 2023

  61. [61]

    A Comparative Study of Software Secrets Reporting by Secret Detection Tools,

    S. K. Basaket al., “A Comparative Study of Software Secrets Reporting by Secret Detection Tools,” inESEM, 2023

  62. [62]

    Secrets Revealed in Container Images: An Internet-wide Study on Occurrence and Impact,

    M. Dahlmannset al., “Secrets Revealed in Container Images: An Internet-wide Study on Occurrence and Impact,” inASIACCS, 2023

  63. [63]

    Leaky Apps: Large-scale Analysis of Secrets Distributed in Android and iOS Apps,

    D. Schmidtet al., “Leaky Apps: Large-scale Analysis of Secrets Distributed in Android and iOS Apps,” inCCS, 2025

  64. [64]

    latexindent.pl,

    C. Hughes, “latexindent.pl,” https://github.com/cmhughes/latexindent. pl, 2012

  65. [65]

    arXiv Cleaner,

    Elsa Lab, “arXiv Cleaner,” https://github.com/elsa-lab/arxiv-cleaner, 2019

  66. [66]

    Submission Sanitizer & Flattener,

    D. Stutz, “Submission Sanitizer & Flattener,” https://github.com/ davidstutz/arxiv-submission-sanitizer-flattener, 2022

  67. [67]

    Dirty Metadata: Understanding A Threat to Online Privacy,

    C. Gouert and N. G. Tsoutsos, “Dirty Metadata: Understanding A Threat to Online Privacy,”IEEE Secur. Priv., vol. 20, no. 6, 2022

  68. [68]

    MetaLeak: Assessing Image Metadata Leakage in Android Apps,

    T. T. L. Nguyenet al., “MetaLeak: Assessing Image Metadata Leakage in Android Apps,” inAICCSA, 2024

  69. [69]

    ExifTool,

    P. Harvey, “ExifTool,” https://exiftool.org/, 2003

  70. [70]

    Berkenbilt, “qpdf,” https://github.com/qpdf/qpdf, 2005

    J. Berkenbilt, “qpdf,” https://github.com/qpdf/qpdf, 2005

  71. [71]

    mat2 – Metadata and privacy,

    J. V oisin, “mat2 – Metadata and privacy,” https://github.com/jvoisin/ mat2, 2018

  72. [72]

    Security and privacy issues in the Portable Document Format,

    A. Castiglioneet al., “Security and privacy issues in the Portable Document Format,”JSS, vol. 83, no. 10, 2010

  73. [73]

    Malicious PDF Documents Explained,

    D. Stevens, “Malicious PDF Documents Explained,”SP, vol. 9, no. 1, 2011

  74. [74]

    Malicious PDF Detection using Metadata and Structural Features,

    C. Smutz and A. Stavrou, “Malicious PDF Detection using Metadata and Structural Features,” inACSAC, 2012

  75. [75]

    How are PDF files published in the Scientific Community?

    S. Adhatarao and C. Lauradoux, “How are PDF files published in the Scientific Community?” inWIFS, 2021

  76. [76]

    You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models,

    R. A. Dubniczkyet al., “You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models,” arXiv:2510.03761, 2025

  77. [77]

    X-raying the arXiv: A Large-Scale Analysis of arXiv Submissions’ Source Files,

    G. Apruzzese and A. Fass, “X-raying the arXiv: A Large-Scale Analysis of arXiv Submissions’ Source Files,” arXiv:2601.11385, 2026

  78. [78]

    The 00README File Format,

    arXiv, “The 00README File Format,” https://info.arxiv.org/help/ 00README.html, 2010

  79. [79]

    Are Your Documents Leaking Sensitive Information? Scrub Your Metadata!

    M. Spiegel, “Are Your Documents Leaking Sensitive Information? Scrub Your Metadata!” https://er.educause.edu/blogs/2017/1/are-your- documents-leaking-sensitive-information-scrub-your-metadata, 2017

  80. [80]

    Tree-sitter,

    tree-sitter, “Tree-sitter,” https://tree-sitter.github.io/tree-sitter/, 2018

Showing first 80 references.