TopVenues: A Reproducible Corpus and Tooling Substrate for Cybersecurity Literature Reviews
Pith reviewed 2026-06-27 00:16 UTC · model grok-4.3
The pith
TopVenues turns shifting publisher data into a fixed, versioned SQLite corpus for cybersecurity literature reviews.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TopVenues declares a venue and year scope, uses DBLP as the metadata spine, enriches records with abstracts and BibTeX via open APIs and publisher extractors, and stores the results in a monotonic SQLite snapshot that functions as an executable, inspectable, and citable corpus. The approach produces 99.86 percent abstract coverage and 99.99 percent BibTeX coverage on 9,925 papers while enabling the reported preprint statistics as direct, repeatable outputs of the same artifact.
What carries the argument
The monotonic SQLite snapshot that serves as the fixed denominator, built from DBLP metadata enriched by open scholarly APIs.
If this is right
- Any review protocol can cite and reuse the exact same corpus snapshot, eliminating reconstruction drift across studies.
- Preprint appearance rates, lead times, and author-based triage filters become measurable quantities that can be recomputed on later snapshots.
- Keyword search, data-integrity validation, and export to review tools all operate against the same frozen data set.
- The corpus itself becomes a citable research artifact rather than an ad-hoc reconstruction.
- Precision-recall tradeoffs for preprint triage can be reported against the fixed denominator for direct comparison.
Where Pith is reading between the lines
- The same snapshot approach could be applied to other computer science subfields that maintain stable venue lists.
- Longitudinal snapshots would allow tracking of how preprint-to-publication patterns evolve over multiple years.
- Linking the corpus to screening software could reduce manual steps in the early stages of a review.
- Extending coverage to additional venues would require only updates to the scope declaration rather than rebuilding the entire pipeline.
Load-bearing premise
DBLP combined with the chosen open APIs and extractors captures the intended cybersecurity literature without omissions or errors that would change the coverage or preprint numbers.
What would settle it
An independent audit that finds more than a small percentage of papers from the declared venues and years missing from or misclassified in the DBLP-derived corpus would falsify reliable coverage.
Figures
read the original abstract
Cybersecurity literature reviews require a reproducible denominator: the set of papers that a protocol includes before screening and synthesis begin. Today, that denominator is often reconstructed from publisher portals, bibliographic indices, and scholarly application programming interfaces (APIs) whose coverage, formats, and query semantics change over time. This paper presents TopVenues, an open-source system that materializes corpus construction as a versioned research artifact. TopVenues declares a venue and year scope, uses DBLP Computer Science Bibliography (DBLP) as the metadata spine, enriches records with abstracts and BibTeX entries via open scholarly APIs and publisher-specific extractors, and stores the results in a monotonic SQLite snapshot, accessible via a command-line interface (CLI), a web interface, and export paths for review workflows. The May 2026 snapshot contains 9,925 papers from 11 cybersecurity sources over 2017 to 2026, with 99.86% abstract coverage and 99.99% BibTeX coverage; keyword search over the full corpus completes in under 31 ms, and a 250-test suite validates the data-integrity invariants. The fixed denominator also enables repeatable measurement: 29.2% of 2024 to 2025 papers from the four top-ranked security conferences in our scope appear as arXiv preprints, with a median of five months before publication, and a prior-author-track-record filter yields a 16.5x precision gain at 90% recall for triaging preprints that later appear in the same venue set. TopVenues links corpus construction to auditable cybersecurity measurement by making the corpus itself executable, inspectable, and citable. The artifact is available at https://github.com/sidneibarbieri/topVenues.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents TopVenues, an open-source system that materializes corpus construction for cybersecurity literature reviews as a versioned research artifact. It declares venue/year scopes, uses DBLP as the metadata spine, enriches records with abstracts and BibTeX via open APIs and publisher extractors, and stores results in a monotonic SQLite snapshot accessible via CLI, web interface, and exports. The May 2026 snapshot contains 9,925 papers from 11 sources (2017-2026) with 99.86% abstract coverage and 99.99% BibTeX coverage; a 250-test suite validates internal invariants, keyword search completes in <31 ms, and the fixed corpus enables measurements including a 29.2% arXiv preprint rate (median 5 months prior) for 2024-2025 papers from four top security conferences plus a 16.5x precision gain at 90% recall from a prior-author-track-record filter.
Significance. If the DBLP-based construction and enrichment process accurately reflects the declared scope without material omissions, the work supplies a citable, executable, and auditable substrate that directly addresses the reproducibility problem in cybersecurity literature reviews. The open artifact, monotonic snapshots, fast query performance, and concrete empirical measurements on preprints are concrete strengths that could support more rigorous review protocols.
major comments (1)
- [validation and empirical measurements sections] The description of the 250-test suite (which validates internal invariants such as coverage percentages and BibTeX presence) does not include external cross-validation against publisher proceedings or manual sampling of recent papers. This is load-bearing for the headline empirical claims (29.2% arXiv rate and 16.5x filter gain), because systematic DBLP gaps or title/author matching errors for 2024-2025 conference papers would directly alter those statistics.
minor comments (1)
- [Abstract] The abstract refers to a 'May 2026 snapshot'; clarifying the exact snapshot date, versioning scheme, and how future snapshots remain monotonic would strengthen reproducibility claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We address the single major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [validation and empirical measurements sections] The description of the 250-test suite (which validates internal invariants such as coverage percentages and BibTeX presence) does not include external cross-validation against publisher proceedings or manual sampling of recent papers. This is load-bearing for the headline empirical claims (29.2% arXiv rate and 16.5x filter gain), because systematic DBLP gaps or title/author matching errors for 2024-2025 conference papers would directly alter those statistics.
Authors: We agree that the validation described is internal to the DBLP-derived corpus and its invariants. External cross-validation against publisher proceedings or manual sampling of recent papers is not reported in the current manuscript. This is a valid concern for the reliability of the 29.2% arXiv preprint rate and 16.5x filter gain, as any systematic DBLP omissions or matching errors in 2024-2025 would affect those figures. In the revised manuscript we will add a new subsection under validation that reports the results of manual sampling: we will randomly select and manually verify 100 papers from the four top conferences in 2024-2025 against the corresponding ACM/IEEE/Springer proceedings pages, reporting match rate, any discrepancies, and their impact (if any) on the empirical measurements. We will also add an explicit limitations paragraph noting that DBLP coverage, while high for these venues, is not guaranteed to be exhaustive for the most recent year. revision: yes
Circularity Check
No circularity; direct corpus construction and empirical observations
full rationale
The paper constructs a versioned corpus via DBLP spine plus API enrichment and reports direct measurements (29.2% arXiv rate, median lag, 16.5x filter gain) computed on that corpus. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness claims appear; the reported figures are observable outputs of the described process rather than reductions to inputs by construction. The 250-test suite validates internal invariants only, with no derivation chain present.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption DBLP Computer Science Bibliography supplies sufficiently complete and stable metadata for the declared venue-year scope
Forward citations
Cited by 2 Pith papers
-
ARENA: An Architecture for Measuring the Transferability of Autonomous Cyber Defense
ARENA creates anonymized SOC telemetry artifacts that reveal a measurable privacy-utility boundary when used both as training material for MITRE-mapped challenges and as a substrate to detect non-compliant LLM defende...
-
From Production SIEM to Reusable Cybersecurity Artifacts
Methodology turns private production SIEM logs into reusable, anonymized cybersecurity artifacts validated on 37 ATT&CK-mapped challenges and 200 SOCpilot incidents.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , pages =
Ammar, Waleed and others , title =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , pages =. 2018 , publisher =
2018
-
[2]
Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE) , pages =
Ley, Michael , title =. Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE) , pages =. 2002 , publisher =
2002
-
[3]
arXiv preprint arXiv:2205.01833 , year =
Priem, Jason and Piwowar, Heather and Orr, Richard , title =. arXiv preprint arXiv:2205.01833 , year =
-
[4]
, title =
Lo, Kyle and Wang, Lucy Lu and Neumann, Mark and Kinney, Rodney and Weld, Daniel S. , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =. 2020 , doi =
2020
-
[5]
Journal of Computer Virology and Hacking Techniques , volume =
Luh, Robert and Marschalek, Stefan and Kaiser, Manfred and Janicke, Helge and Schrittwieser, Sebastian , title =. Journal of Computer Virology and Hacking Techniques , volume =. 2017 , publisher =
2017
-
[6]
Cybersecurity , volume =
Khraisat, Ansam and Gondal, Iqbal and Vamplew, Peter and Kamruzzaman, Joarder , title =. Cybersecurity , volume =. 2019 , publisher =
2019
-
[7]
and others , title =
Wilkinson, Mark D. and others , title =. Scientific Data , volume =. 2016 , doi =
2016
-
[8]
Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , pages =
Olszewski, Daniel and others , title =. Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , pages =. 2023 , doi =
2023
-
[9]
Crowder, Anna and Lu, Allison and Childs, Kevin and Stillman, Carson and Traynor, Patrick and Butler, Kevin R. B. , title =. 2025 IEEE Symposium on Security and Privacy , pages =. 2025 , doi =
2025
-
[10]
Olszewski, Daniel and Tucker, Tyler and Butler, Kevin R. B. and Traynor, Patrick , title =. 34th USENIX Security Symposium , pages =. 2025 , url =
2025
-
[11]
and others , title =
Page, Matthew J. and others , title =. BMJ , volume =. 2021 , doi =
2021
-
[12]
Nature Machine Intelligence , volume =
van de Schoot, Rens and others , title =. Nature Machine Intelligence , volume =. 2021 , doi =
2021
-
[13]
Systematic Reviews , volume =
Ouzzani, Mourad and Hammady, Hossam and Fedorowicz, Zbys and Elmagarmid, Ahmed , title =. Systematic Reviews , volume =. 2016 , doi =
2016
-
[14]
and others , title =
Rethlefsen, Melissa L. and others , title =. Systematic Reviews , volume =. 2021 , doi =
2021
-
[15]
Kitchenham, Barbara and Charters, Stuart , title =
-
[16]
Proceedings of the VLDB Endowment , volume =
Ley, Michael , title =. Proceedings of the VLDB Endowment , volume =. 2009 , doi =
2009
-
[17]
Searching for Diamonds: Cross-Domain Opportunities in Cyber Threat Intelligence , journal =
Barbieri, Sidnei and De Souza, Flavio Luiz Dos Santos and Teixeira, Marcio Andrey and Marcondes, Cesar Augusto Cavalheiro and Pereira, Louren. Searching for Diamonds: Cross-Domain Opportunities in Cyber Threat Intelligence , journal =. 2025 , publisher =
2025
-
[18]
Information and Software Technology , volume =
Van Dinter, Raymon and Tekinerdogan, Bedir and Catal, Cagatay , title =. Information and Software Technology , volume =. 2021 , publisher =
2021
-
[19]
and Marrone, Mauricio and Singh, Abhay K
Linnenluecke, Martina K. and Marrone, Mauricio and Singh, Abhay K. , title =. Australian Journal of Management , volume =. 2020 , publisher =
2020
-
[20]
Proceedings of the 2018
Shu, Xiaokui and others , title =. Proceedings of the 2018. 2018 , doi =
2018
-
[21]
and Savage, Stefan , title =
Li, Vector Guo and Dunn, Matthew and Pearce, Paul and McCoy, Damon and Voelker, Geoffrey M. and Savage, Stefan , title =. 28th. 2019 , url =
2019
-
[22]
Bouwman, Xander and Griffioen, Harm and Egbers, Jelle and Doerr, Christian and Klievink, Bram and van Eeten, Michel , title =. 29th. 2020 , url =
2020
-
[23]
Bouwman, Xander and others , title =. 31st. 2022 , url =
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.