TopVenues: A Reproducible Corpus and Tooling Substrate for Cybersecurity Literature Reviews

\'Agney Lopes Roth Ferraz; Louren\c{c}o Alves Pereira J\'unior; Sidnei Barbieri

arxiv: 2606.18320 · v1 · pith:VXBYALKMnew · submitted 2026-06-16 · 💻 cs.CR

TopVenues: A Reproducible Corpus and Tooling Substrate for Cybersecurity Literature Reviews

Sidnei Barbieri , \'Agney Lopes Roth Ferraz , Louren\c{c}o Alves Pereira J\'unior This is my paper

Pith reviewed 2026-06-27 00:16 UTC · model grok-4.3

classification 💻 cs.CR

keywords cybersecurityliterature reviewreproducible corpusarXiv preprintsDBLPversioned databibliographic metadataconference papers

0 comments

The pith

TopVenues turns shifting publisher data into a fixed, versioned SQLite corpus for cybersecurity literature reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Cybersecurity literature reviews need a stable denominator of papers before screening begins, yet most current denominators are rebuilt from portals and APIs whose coverage changes. TopVenues addresses this by declaring a venue-year scope, anchoring metadata in DBLP, enriching abstracts and BibTeX entries through open APIs and extractors, and storing the result in a monotonic SQLite snapshot. The May 2026 snapshot holds 9,925 papers across 11 sources from 2017 to 2026 at 99.86 percent abstract coverage. This fixed corpus supports repeatable measurements, including the finding that 29.2 percent of 2024-2025 papers from four top conferences first appear on arXiv with a five-month median lead time. An author track-record filter then raises precision 16.5-fold at 90 percent recall when identifying preprints that later reach those venues.

Core claim

TopVenues declares a venue and year scope, uses DBLP as the metadata spine, enriches records with abstracts and BibTeX via open APIs and publisher extractors, and stores the results in a monotonic SQLite snapshot that functions as an executable, inspectable, and citable corpus. The approach produces 99.86 percent abstract coverage and 99.99 percent BibTeX coverage on 9,925 papers while enabling the reported preprint statistics as direct, repeatable outputs of the same artifact.

What carries the argument

The monotonic SQLite snapshot that serves as the fixed denominator, built from DBLP metadata enriched by open scholarly APIs.

If this is right

Any review protocol can cite and reuse the exact same corpus snapshot, eliminating reconstruction drift across studies.
Preprint appearance rates, lead times, and author-based triage filters become measurable quantities that can be recomputed on later snapshots.
Keyword search, data-integrity validation, and export to review tools all operate against the same frozen data set.
The corpus itself becomes a citable research artifact rather than an ad-hoc reconstruction.
Precision-recall tradeoffs for preprint triage can be reported against the fixed denominator for direct comparison.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same snapshot approach could be applied to other computer science subfields that maintain stable venue lists.
Longitudinal snapshots would allow tracking of how preprint-to-publication patterns evolve over multiple years.
Linking the corpus to screening software could reduce manual steps in the early stages of a review.
Extending coverage to additional venues would require only updates to the scope declaration rather than rebuilding the entire pipeline.

Load-bearing premise

DBLP combined with the chosen open APIs and extractors captures the intended cybersecurity literature without omissions or errors that would change the coverage or preprint numbers.

What would settle it

An independent audit that finds more than a small percentage of papers from the declared venues and years missing from or misclassified in the DBLP-derived corpus would falsify reliable coverage.

Figures

Figures reproduced from arXiv: 2606.18320 by \'Agney Lopes Roth Ferraz, Louren\c{c}o Alves Pereira J\'unior, Sidnei Barbieri.

read the original abstract

Cybersecurity literature reviews require a reproducible denominator: the set of papers that a protocol includes before screening and synthesis begin. Today, that denominator is often reconstructed from publisher portals, bibliographic indices, and scholarly application programming interfaces (APIs) whose coverage, formats, and query semantics change over time. This paper presents TopVenues, an open-source system that materializes corpus construction as a versioned research artifact. TopVenues declares a venue and year scope, uses DBLP Computer Science Bibliography (DBLP) as the metadata spine, enriches records with abstracts and BibTeX entries via open scholarly APIs and publisher-specific extractors, and stores the results in a monotonic SQLite snapshot, accessible via a command-line interface (CLI), a web interface, and export paths for review workflows. The May 2026 snapshot contains 9,925 papers from 11 cybersecurity sources over 2017 to 2026, with 99.86% abstract coverage and 99.99% BibTeX coverage; keyword search over the full corpus completes in under 31 ms, and a 250-test suite validates the data-integrity invariants. The fixed denominator also enables repeatable measurement: 29.2% of 2024 to 2025 papers from the four top-ranked security conferences in our scope appear as arXiv preprints, with a median of five months before publication, and a prior-author-track-record filter yields a 16.5x precision gain at 90% recall for triaging preprints that later appear in the same venue set. TopVenues links corpus construction to auditable cybersecurity measurement by making the corpus itself executable, inspectable, and citable. The artifact is available at https://github.com/sidneibarbieri/topVenues.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TopVenues ships a working open-source pipeline that builds a versioned SQLite corpus from DBLP for cybersecurity papers, plus some concrete preprint measurements, but the headline numbers rest on unverified DBLP coverage.

read the letter

The paper's main contribution is a practical system that declares a venue scope, pulls from DBLP, enriches via APIs, and dumps everything into a monotonic snapshot with a 250-test suite. The May 2026 snapshot claims 99.86% abstract coverage and sub-31 ms searches. It also reports 29.2% of recent top-conference papers appearing on arXiv and a 16.5x precision lift from an author-track-record filter.

That tooling substrate is new in its specific combination of monotonic storage, validation tests, and public GitHub artifact. The reproducibility angle is handled cleanly for anyone who wants a fixed denominator for literature reviews.

The soft spot is coverage. The stress-test concern lands: the 29.2% and 16.5x figures require that DBLP contains every paper from the four target venues in 2024-2025 and that the arXiv matching is accurate. The paper describes no external cross-check against publisher proceedings or manual sampling, only internal invariants. If DBLP misses recent security papers or matching fails on titles, those numbers move. The abstract does not claim otherwise.

This is for people running systematic reviews in cybersecurity who need an auditable starting set rather than ad-hoc searches. A reader who already maintains their own corpus might skip it; someone starting fresh or wanting to cite a reproducible baseline could use the snapshot directly.

It deserves peer review. The artifact is public, the claims are descriptive and falsifiable, and the reproducibility focus is useful even if the coverage question needs tightening.

Referee Report

1 major / 1 minor

Summary. The manuscript presents TopVenues, an open-source system that materializes corpus construction for cybersecurity literature reviews as a versioned research artifact. It declares venue/year scopes, uses DBLP as the metadata spine, enriches records with abstracts and BibTeX via open APIs and publisher extractors, and stores results in a monotonic SQLite snapshot accessible via CLI, web interface, and exports. The May 2026 snapshot contains 9,925 papers from 11 sources (2017-2026) with 99.86% abstract coverage and 99.99% BibTeX coverage; a 250-test suite validates internal invariants, keyword search completes in <31 ms, and the fixed corpus enables measurements including a 29.2% arXiv preprint rate (median 5 months prior) for 2024-2025 papers from four top security conferences plus a 16.5x precision gain at 90% recall from a prior-author-track-record filter.

Significance. If the DBLP-based construction and enrichment process accurately reflects the declared scope without material omissions, the work supplies a citable, executable, and auditable substrate that directly addresses the reproducibility problem in cybersecurity literature reviews. The open artifact, monotonic snapshots, fast query performance, and concrete empirical measurements on preprints are concrete strengths that could support more rigorous review protocols.

major comments (1)

[validation and empirical measurements sections] The description of the 250-test suite (which validates internal invariants such as coverage percentages and BibTeX presence) does not include external cross-validation against publisher proceedings or manual sampling of recent papers. This is load-bearing for the headline empirical claims (29.2% arXiv rate and 16.5x filter gain), because systematic DBLP gaps or title/author matching errors for 2024-2025 conference papers would directly alter those statistics.

minor comments (1)

[Abstract] The abstract refers to a 'May 2026 snapshot'; clarifying the exact snapshot date, versioning scheme, and how future snapshots remain monotonic would strengthen reproducibility claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [validation and empirical measurements sections] The description of the 250-test suite (which validates internal invariants such as coverage percentages and BibTeX presence) does not include external cross-validation against publisher proceedings or manual sampling of recent papers. This is load-bearing for the headline empirical claims (29.2% arXiv rate and 16.5x filter gain), because systematic DBLP gaps or title/author matching errors for 2024-2025 conference papers would directly alter those statistics.

Authors: We agree that the validation described is internal to the DBLP-derived corpus and its invariants. External cross-validation against publisher proceedings or manual sampling of recent papers is not reported in the current manuscript. This is a valid concern for the reliability of the 29.2% arXiv preprint rate and 16.5x filter gain, as any systematic DBLP omissions or matching errors in 2024-2025 would affect those figures. In the revised manuscript we will add a new subsection under validation that reports the results of manual sampling: we will randomly select and manually verify 100 papers from the four top conferences in 2024-2025 against the corresponding ACM/IEEE/Springer proceedings pages, reporting match rate, any discrepancies, and their impact (if any) on the empirical measurements. We will also add an explicit limitations paragraph noting that DBLP coverage, while high for these venues, is not guaranteed to be exhaustive for the most recent year. revision: yes

Circularity Check

0 steps flagged

No circularity; direct corpus construction and empirical observations

full rationale

The paper constructs a versioned corpus via DBLP spine plus API enrichment and reports direct measurements (29.2% arXiv rate, median lag, 16.5x filter gain) computed on that corpus. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness claims appear; the reported figures are observable outputs of the described process rather than reductions to inputs by construction. The 250-test suite validates internal invariants only, with no derivation chain present.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The system depends on DBLP as the metadata spine and on external open APIs for enrichment; these are treated as reliable inputs rather than derived within the paper.

axioms (1)

domain assumption DBLP Computer Science Bibliography supplies sufficiently complete and stable metadata for the declared venue-year scope
The paper selects DBLP as the metadata spine without additional validation of its coverage for the cybersecurity venues.

pith-pipeline@v0.9.1-grok · 5876 in / 1333 out tokens · 44956 ms · 2026-06-27T00:16:27.820189+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ARENA: An Architecture for Measuring the Transferability of Autonomous Cyber Defense
cs.CR 2026-06 unverdicted novelty 5.0

ARENA creates anonymized SOC telemetry artifacts that reveal a measurable privacy-utility boundary when used both as training material for MITRE-mapped challenges and as a substrate to detect non-compliant LLM defende...
From Production SIEM to Reusable Cybersecurity Artifacts
cs.CR 2026-06 unverdicted novelty 4.0

Methodology turns private production SIEM logs into reusable, anonymized cybersecurity artifacts validated on 37 ATT&CK-mapped challenges and 200 SOCpilot incidents.

Reference graph

Works this paper leans on

23 extracted references · 1 linked inside Pith · cited by 2 Pith papers

[1]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , pages =

Ammar, Waleed and others , title =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , pages =. 2018 , publisher =

2018
[2]

Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE) , pages =

Ley, Michael , title =. Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE) , pages =. 2002 , publisher =

2002
[3]

arXiv preprint arXiv:2205.01833 , year =

Priem, Jason and Piwowar, Heather and Orr, Richard , title =. arXiv preprint arXiv:2205.01833 , year =

Pith/arXiv arXiv
[4]

, title =

Lo, Kyle and Wang, Lucy Lu and Neumann, Mark and Kinney, Rodney and Weld, Daniel S. , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =. 2020 , doi =

2020
[5]

Journal of Computer Virology and Hacking Techniques , volume =

Luh, Robert and Marschalek, Stefan and Kaiser, Manfred and Janicke, Helge and Schrittwieser, Sebastian , title =. Journal of Computer Virology and Hacking Techniques , volume =. 2017 , publisher =

2017
[6]

Cybersecurity , volume =

Khraisat, Ansam and Gondal, Iqbal and Vamplew, Peter and Kamruzzaman, Joarder , title =. Cybersecurity , volume =. 2019 , publisher =

2019
[7]

and others , title =

Wilkinson, Mark D. and others , title =. Scientific Data , volume =. 2016 , doi =

2016
[8]

Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , pages =

Olszewski, Daniel and others , title =. Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , pages =. 2023 , doi =

2023
[9]

Crowder, Anna and Lu, Allison and Childs, Kevin and Stillman, Carson and Traynor, Patrick and Butler, Kevin R. B. , title =. 2025 IEEE Symposium on Security and Privacy , pages =. 2025 , doi =

2025
[10]

Olszewski, Daniel and Tucker, Tyler and Butler, Kevin R. B. and Traynor, Patrick , title =. 34th USENIX Security Symposium , pages =. 2025 , url =

2025
[11]

and others , title =

Page, Matthew J. and others , title =. BMJ , volume =. 2021 , doi =

2021
[12]

Nature Machine Intelligence , volume =

van de Schoot, Rens and others , title =. Nature Machine Intelligence , volume =. 2021 , doi =

2021
[13]

Systematic Reviews , volume =

Ouzzani, Mourad and Hammady, Hossam and Fedorowicz, Zbys and Elmagarmid, Ahmed , title =. Systematic Reviews , volume =. 2016 , doi =

2016
[14]

and others , title =

Rethlefsen, Melissa L. and others , title =. Systematic Reviews , volume =. 2021 , doi =

2021
[15]

Kitchenham, Barbara and Charters, Stuart , title =
[16]

Proceedings of the VLDB Endowment , volume =

Ley, Michael , title =. Proceedings of the VLDB Endowment , volume =. 2009 , doi =

2009
[17]

Searching for Diamonds: Cross-Domain Opportunities in Cyber Threat Intelligence , journal =

Barbieri, Sidnei and De Souza, Flavio Luiz Dos Santos and Teixeira, Marcio Andrey and Marcondes, Cesar Augusto Cavalheiro and Pereira, Louren. Searching for Diamonds: Cross-Domain Opportunities in Cyber Threat Intelligence , journal =. 2025 , publisher =

2025
[18]

Information and Software Technology , volume =

Van Dinter, Raymon and Tekinerdogan, Bedir and Catal, Cagatay , title =. Information and Software Technology , volume =. 2021 , publisher =

2021
[19]

and Marrone, Mauricio and Singh, Abhay K

Linnenluecke, Martina K. and Marrone, Mauricio and Singh, Abhay K. , title =. Australian Journal of Management , volume =. 2020 , publisher =

2020
[20]

Proceedings of the 2018

Shu, Xiaokui and others , title =. Proceedings of the 2018. 2018 , doi =

2018
[21]

and Savage, Stefan , title =

Li, Vector Guo and Dunn, Matthew and Pearce, Paul and McCoy, Damon and Voelker, Geoffrey M. and Savage, Stefan , title =. 28th. 2019 , url =

2019
[22]

Bouwman, Xander and Griffioen, Harm and Egbers, Jelle and Doerr, Christian and Klievink, Bram and van Eeten, Michel , title =. 29th. 2020 , url =

2020
[23]

Bouwman, Xander and others , title =. 31st. 2022 , url =

2022

[1] [1]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , pages =

Ammar, Waleed and others , title =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , pages =. 2018 , publisher =

2018

[2] [2]

Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE) , pages =

Ley, Michael , title =. Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE) , pages =. 2002 , publisher =

2002

[3] [3]

arXiv preprint arXiv:2205.01833 , year =

Priem, Jason and Piwowar, Heather and Orr, Richard , title =. arXiv preprint arXiv:2205.01833 , year =

Pith/arXiv arXiv

[4] [4]

, title =

Lo, Kyle and Wang, Lucy Lu and Neumann, Mark and Kinney, Rodney and Weld, Daniel S. , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =. 2020 , doi =

2020

[5] [5]

Journal of Computer Virology and Hacking Techniques , volume =

Luh, Robert and Marschalek, Stefan and Kaiser, Manfred and Janicke, Helge and Schrittwieser, Sebastian , title =. Journal of Computer Virology and Hacking Techniques , volume =. 2017 , publisher =

2017

[6] [6]

Cybersecurity , volume =

Khraisat, Ansam and Gondal, Iqbal and Vamplew, Peter and Kamruzzaman, Joarder , title =. Cybersecurity , volume =. 2019 , publisher =

2019

[7] [7]

and others , title =

Wilkinson, Mark D. and others , title =. Scientific Data , volume =. 2016 , doi =

2016

[8] [8]

Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , pages =

Olszewski, Daniel and others , title =. Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , pages =. 2023 , doi =

2023

[9] [9]

Crowder, Anna and Lu, Allison and Childs, Kevin and Stillman, Carson and Traynor, Patrick and Butler, Kevin R. B. , title =. 2025 IEEE Symposium on Security and Privacy , pages =. 2025 , doi =

2025

[10] [10]

Olszewski, Daniel and Tucker, Tyler and Butler, Kevin R. B. and Traynor, Patrick , title =. 34th USENIX Security Symposium , pages =. 2025 , url =

2025

[11] [11]

and others , title =

Page, Matthew J. and others , title =. BMJ , volume =. 2021 , doi =

2021

[12] [12]

Nature Machine Intelligence , volume =

van de Schoot, Rens and others , title =. Nature Machine Intelligence , volume =. 2021 , doi =

2021

[13] [13]

Systematic Reviews , volume =

Ouzzani, Mourad and Hammady, Hossam and Fedorowicz, Zbys and Elmagarmid, Ahmed , title =. Systematic Reviews , volume =. 2016 , doi =

2016

[14] [14]

and others , title =

Rethlefsen, Melissa L. and others , title =. Systematic Reviews , volume =. 2021 , doi =

2021

[15] [15]

Kitchenham, Barbara and Charters, Stuart , title =

[16] [16]

Proceedings of the VLDB Endowment , volume =

Ley, Michael , title =. Proceedings of the VLDB Endowment , volume =. 2009 , doi =

2009

[17] [17]

Searching for Diamonds: Cross-Domain Opportunities in Cyber Threat Intelligence , journal =

Barbieri, Sidnei and De Souza, Flavio Luiz Dos Santos and Teixeira, Marcio Andrey and Marcondes, Cesar Augusto Cavalheiro and Pereira, Louren. Searching for Diamonds: Cross-Domain Opportunities in Cyber Threat Intelligence , journal =. 2025 , publisher =

2025

[18] [18]

Information and Software Technology , volume =

Van Dinter, Raymon and Tekinerdogan, Bedir and Catal, Cagatay , title =. Information and Software Technology , volume =. 2021 , publisher =

2021

[19] [19]

and Marrone, Mauricio and Singh, Abhay K

Linnenluecke, Martina K. and Marrone, Mauricio and Singh, Abhay K. , title =. Australian Journal of Management , volume =. 2020 , publisher =

2020

[20] [20]

Proceedings of the 2018

Shu, Xiaokui and others , title =. Proceedings of the 2018. 2018 , doi =

2018

[21] [21]

and Savage, Stefan , title =

Li, Vector Guo and Dunn, Matthew and Pearce, Paul and McCoy, Damon and Voelker, Geoffrey M. and Savage, Stefan , title =. 28th. 2019 , url =

2019

[22] [22]

Bouwman, Xander and Griffioen, Harm and Egbers, Jelle and Doerr, Christian and Klievink, Bram and van Eeten, Michel , title =. 29th. 2020 , url =

2020

[23] [23]

Bouwman, Xander and others , title =. 31st. 2022 , url =

2022