pith. sign in

arxiv: 2605.15345 · v1 · pith:N6EZ6AJSnew · submitted 2026-05-14 · 💻 cs.CR

Topical Shifts in the Dark Web: A Longitudinal Analysis of Content from the Cybercrime Ecosystem

Pith reviewed 2026-05-19 14:36 UTC · model grok-4.3

classification 💻 cs.CR
keywords dark webcybercrimetopic modelinglongitudinal analysisforum evolutionthreat intelligenceclusteringweb snapshots
0
0 comments X p. Extension
pith:N6EZ6AJS Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{N6EZ6AJS}

Prints a linked pith:N6EZ6AJS badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Dark web cybercrime discussions concentrate 75% of their volume in a small set of persistent core topics that last a median of 75 months.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks how content on dark web forums and marketplaces shifts across six years by processing more than 11 million webpage snapshots from over 25,000 sites. It builds a topic-modeling system that groups discussions into 55 clusters and measures how long each cluster remains active and how much total talk it attracts. The central finding is that most activity stays inside a handful of enduring themes while short-lived topics contribute almost nothing to the overall volume. A sympathetic reader would care because this pattern implies that the cybercrime ecosystem changes more slowly and predictably than static snapshots have suggested. If the result holds, monitoring and intelligence work can shift from chasing every new fad to tracking the stable cores that dominate the conversation.

Core claim

Analysis of 25,065 dark web websites through 11,403,638 HTML snapshots collected over six years identifies 55 thematic clusters in which approximately 75 percent of total discussion volume resides in a small number of persistent core topics, short-lived themes account for only about 3 percent of activity, and the median topic lifespan reaches 75 months, demonstrating gradual thematic evolution rather than sudden replacement.

What carries the argument

A longitudinal topic-modeling framework that combines domain-specific embeddings, density-based clustering, and temporal aggregation to measure topic prevalence and lifecycle at the website level.

If this is right

  • Law-enforcement and threat-intelligence efforts can achieve higher returns by concentrating on the small set of persistent core topics instead of monitoring every emerging theme.
  • Static single-point snapshots of dark web content miss the long-term stability that characterizes most discussion volume.
  • Cybercrime forums and marketplaces adapt to external pressures such as enforcement actions through slow, incremental shifts rather than wholesale replacement of topics.
  • The 3 percent share held by short-lived themes indicates that transient events or hype cycles contribute little to the overall activity on these platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gradual-evolution pattern may appear in other pressured online environments such as clear-web fraud forums or encrypted messaging groups.
  • Resource allocation for continuous monitoring could be reduced by maintaining lightweight trackers only on the identified core topics.
  • If topic lifespans remain stable across future years, the 75-month median supplies a natural time window for longitudinal studies of how specific enforcement events influence discussion volume.

Load-bearing premise

The topic-modeling approach correctly groups real discussion themes and tracks their true lifespans without major distortion from the way the snapshots were collected or from choices made during clustering.

What would settle it

A new collection of dark web snapshots processed with the same framework that shows either markedly lower concentration in core topics or a median lifespan well below 75 months would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15345 by Irdin Pekaric, Luca Allodi, Maximilian Schafer, Philipp Zech, Raffaela Groner, Roy Ricaldi.

Figure 1
Figure 1. Figure 1: Overview of methodology from data preprocessing, to analysis pipeline, and evaluation [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Topic Lifespan Distribution. identified themes are either continuous or recurring, mean￾ing that no final topic cluster appears only once across the observation window. Thus, the dominant topics are not only highly prevalent, but also structurally persistent over time. This indicates that the dark web ecosystem is organized around stable thematic functions. ANSWER TO RQ1: Topic prevalence is highly concen￾… view at source ↗
Figure 3
Figure 3. Figure 3: Temporal Prevalence of Top 20 Topics. disruptive. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Topic Lifecycle. 2020Q1 2020Q2 2020Q3 2020Q4 2021Q1 2021Q2 2021Q3 2021Q4 2022Q1 2022Q2 2022Q3 2022Q4 2023Q1 2023Q2 2023Q3 2023Q4 2024Q1 2024Q2 2024Q3 2024Q4 2025Q1 2025Q2 2025Q3 2025Q4 2026Q1 Time (quarters) 0 10 20 30 40 50 % share of activity Torrents and Files Forum Reputation Online Shopping Forum Features Infrastructure and Hosting Transaction Protection Forum Security Counterfeit Money Databases Onli… view at source ↗
Figure 6
Figure 6. Figure 6: Topic Temporal Prevalence (Top ten Topics). [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Recurring vs One-Off Topics. fade over multiple periods and not disappear immediately. ANSWER TO RQ2: Topic lifecycles are dominated by persistence and gradual change. The median topic lifes￾pan is 75 months, and even the shortest-lived grouped topics remain active for multiple years. Changes in the￾matic prominence occur mainly through gradual growth and decline rather than abrupt emergence or disap￾peara… view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of topic labels in corpus. Annotators corrected these to better align with the actual topic content. Topic Merging. Multiple clusters were found to rep￾resent highly similar or identical concepts. Annotators identified such overlaps and merged them into a single, unified topic to avoid redundancy and improve analytical clarity. A.5.3. Examples of Label Corrections and Topic Merges [PITH_FULL… view at source ↗
read the original abstract

The dark web hosts a dynamic ecosystem of cybercrime forums and marketplaces that adapt to law enforcement pressure, technological change, and economic incentives. Prior research has extracted cyber threat intelligence from these platforms using static snapshots, with limited attention to how discussions evolve over time. In this study, we conduct a longitudinal analysis of 25,065 websites in the dark web using 11,403,638 HTML snapshots (approximately 1245.38 GB) collected over six years. We develop a longitudinal topic-modeling framework combining domain-specific embeddings, density-based clustering and temporal aggregation to measure topic prevalence and lifecycle at the website level. Our analysis identifies 55 thematic clusters. We find that approximately 75% of total discussion volume is concentrated in a small set of persistent core topics, while short-lived themes account for approximately 3% of activity. The median topic lifespan is 75 months, indicating gradual thematic evolution rather than abrupt replacement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes 25,065 dark web websites via 11,403,638 HTML snapshots collected over six years. It introduces a longitudinal topic-modeling pipeline that combines domain-specific embeddings, density-based clustering, and temporal aggregation to extract 55 thematic clusters. The central empirical claims are that ~75% of total discussion volume concentrates in a small set of persistent core topics, short-lived themes account for ~3% of activity, and the median topic lifespan is 75 months, supporting a conclusion of gradual thematic evolution rather than abrupt replacement.

Significance. If the pipeline accurately measures prevalence and lifespan without material distortion, the work supplies large-scale longitudinal evidence on the stability of cybercrime discussions, filling a gap left by prior static-snapshot studies. The dataset scale and website-level tracking constitute clear strengths; the findings would inform threat-intelligence monitoring and law-enforcement resource allocation if shown to be robust.

major comments (2)
  1. [Methods] Methods section (framework description): the density-based clustering and temporal aggregation steps are presented without any reported sensitivity tests on hyperparameters (minimum cluster size, distance threshold, aggregation window). These free parameters directly determine the partition into 55 clusters and the subsequent calculations of the 75% core-topic volume share, 3% short-lived share, and 75-month median lifespan; absence of such checks leaves open the possibility that the gradual-evolution conclusion is an artifact of the chosen settings or non-uniform snapshot collection.
  2. [Results] Results (volume and lifespan claims): the reported 75% and 3% volume figures and 75-month median are given without error bars, bootstrap intervals, or comparisons against alternative clustering algorithms or ground-truth subsets. Because the central claims rest on the correctness of these quantities, the lack of quantitative validation or robustness metrics weakens confidence that the measurements reflect properties of the data rather than pipeline choices.
minor comments (2)
  1. [Data collection] Clarify whether the 1,245.38 GB figure refers to compressed or uncompressed HTML and whether duplicate or low-quality snapshots were filtered before embedding.
  2. [Abstract] The abstract states the dataset size and high-level method but omits any mention of validation metrics; adding a short sentence on inter-annotator agreement or held-out topic coherence would improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We are grateful to the referee for the careful reading and valuable suggestions that will help improve the robustness of our analysis. We respond to each major comment in turn, indicating where we will revise the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Methods] Methods section (framework description): the density-based clustering and temporal aggregation steps are presented without any reported sensitivity tests on hyperparameters (minimum cluster size, distance threshold, aggregation window). These free parameters directly determine the partition into 55 clusters and the subsequent calculations of the 75% core-topic volume share, 3% short-lived share, and 75-month median lifespan; absence of such checks leaves open the possibility that the gradual-evolution conclusion is an artifact of the chosen settings or non-uniform snapshot collection.

    Authors: We concur that reporting sensitivity tests is important for validating the pipeline choices. The hyperparameters were tuned to produce coherent and stable clusters based on initial explorations of the data, but this process was not documented in the submitted manuscript. For the revision, we will include a new subsection in the Methods detailing sensitivity analyses. We will test ranges for minimum cluster size, distance threshold, and aggregation window, and show that the primary conclusions—75% of volume in persistent core topics, 3% in short-lived themes, and a median lifespan of 75 months—hold across these variations. We will also address potential effects of non-uniform snapshot collection by analyzing subsets with more uniform temporal coverage. revision: yes

  2. Referee: [Results] Results (volume and lifespan claims): the reported 75% and 3% volume figures and 75-month median are given without error bars, bootstrap intervals, or comparisons against alternative clustering algorithms or ground-truth subsets. Because the central claims rest on the correctness of these quantities, the lack of quantitative validation or robustness metrics weakens confidence that the measurements reflect properties of the data rather than pipeline choices.

    Authors: The referee correctly notes the absence of uncertainty estimates and comparative validations for the key quantitative results. These figures are obtained by aggregating the cluster memberships over all 11,403,638 snapshots. In the revised paper, we will add bootstrap-derived confidence intervals for the volume shares by resampling at the website level. We will also include a comparison of topic lifespans derived from our density-based method versus an alternative embedding-based clustering technique applied to a random subset of 5,000 websites. Regarding ground-truth subsets, this is not available for the full dataset; however, we will report additional metrics such as average cluster purity based on manual inspection of a sample of clusters to support the findings. revision: partial

standing simulated objections not resolved
  • Complete ground-truth validation for all 55 clusters across the entire dataset is not possible given the scale and sensitive nature of the dark web data.

Circularity Check

0 steps flagged

Empirical measurement pipeline with no definitional or self-referential reduction

full rationale

The paper collects 11M+ HTML snapshots and applies a pipeline of domain-specific embeddings, density-based clustering, and temporal aggregation to extract 55 clusters, then computes prevalence shares and lifespans directly from those clusters. The 75% core-topic volume and 75-month median lifespan are arithmetic summaries of the resulting cluster assignments and time spans; they are not obtained by fitting a parameter to a subset and relabeling it as a prediction, nor by any self-citation that supplies the uniqueness or functional form of the result. No equation or claim in the abstract or described framework reduces to its own inputs by construction. Hyperparameter dependence is a robustness issue, not a circularity issue under the stated criteria.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

Analysis depends on choices in embedding model, density-based clustering parameters, and temporal aggregation windows that are not detailed; these function as free parameters whose values directly shape the reported 55 clusters and lifespan statistics.

free parameters (2)
  • clustering hyperparameters
    Density-based clustering requires distance thresholds and minimum cluster sizes that determine the 55 thematic clusters and the resulting volume percentages.
  • temporal aggregation window
    Choice of time bins for tracking topic prevalence affects measured lifespan and the distinction between persistent and short-lived themes.

pith-pipeline@v0.9.0 · 5710 in / 1216 out tokens · 37523 ms · 2026-05-19T14:36:23.837877+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 2 internal anchors

  1. [1]

    Relevance of the deep web to academic research,

    E. Essien, “Relevance of the deep web to academic research,” International Journal of Natural and Applied Sciences, vol. 12, pp. 107–113, 2020

  2. [2]

    Tor metrics,

    “Tor metrics,” 2025. [Online]. Available: https:// metrics.torproject.org/

  3. [3]

    A review of dark web: Trends and future directions,

    S. Sobhan, T. Williams, M. J. H. Faruk, J. Rodriguez, M. Tasnim, E. Mathew, J. Wright, and H. Shahriar, “A review of dark web: Trends and future directions,” in2022 IEEE 46th Annual COMP- SAC, 2022, pp. 1780–1785

  4. [4]

    The Dark Side of the Web: Towards Understanding Various Data Sources in Cyber Threat Intelligence ,

    S. L. Schroer, N. Canevascini, I. Pekaric, P. Widmer, and P. Laskov, “ The Dark Side of the Web: Towards Understanding Various Data Sources in Cyber Threat Intelligence ,” in2025 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). Los Alamitos, CA, USA: IEEE Computer Society, Jul. 2025, pp. 79–89

  5. [5]

    Robertson, A

    J. Robertson, A. Diab, E. Marin, E. Nunes, V . Paliath, J. Shakar- ian, and P. Shakarian,Darkweb cyber threat intelligence mining. Cambridge University Press, 2017

  6. [6]

    Darknet and deepnet mining for proactive cybersecurity threat intelligence,

    E. Nunes, A. Diab, A. Gunn, E. Marin, V . Mishra, V . Paliath, J. Robertson, J. Shakarian, A. Thart, and P. Shakarian, “Darknet and deepnet mining for proactive cybersecurity threat intelligence,” in2016 IEEE ISI, 2016, pp. 7–12

  7. [7]

    Classifying dark web–related social media discourse using machine learning, deep learning, and transformer models,

    M. S. A. Basha, K. V . Kumar, and R. N D, “Classifying dark web–related social media discourse using machine learning, deep learning, and transformer models,” in2025 6th ICICNIS, 2025

  8. [8]

    Behind the curtain: The illicit trade of firearms, explosives and ammunition on the dark web,

    G. P. Paoli, J. Aldridge, R. Nathan, and R. Warnes, “Behind the curtain: The illicit trade of firearms, explosives and ammunition on the dark web,” 2017

  9. [9]

    Data capture and analysis of darknet markets,

    M. Ball and R. Broadhurst, “Data capture and analysis of darknet markets,”Available at SSRN 3344936, 2021

  10. [10]

    Assessing crime disclosure patterns in a large-scale cybercrime forum,

    R. Hoheisel, T. Meurs, J. Wientjes, M. Junger, A. Abhishta, and M. Paquet-Clouston, “Assessing crime disclosure patterns in a large-scale cybercrime forum,”arXiv preprint arXiv:2603.01624, 2026

  11. [11]

    Do police crackdowns dis- rupt drug cryptomarkets? a longitudinal analysis of the effects of operation onymous,

    D. D ´ecary-H´etu and L. Giommoni, “Do police crackdowns dis- rupt drug cryptomarkets? a longitudinal analysis of the effects of operation onymous,”Crime, Law and Social Change, 2017

  12. [12]

    Research challenges in empowering agile teams with security knowledge based on public and private information sources

    M. Felderer and I. Pekaric, “Research challenges in empowering agile teams with security knowledge based on public and private information sources.” 2017

  13. [13]

    Bridging safety and security in complex systems: A model-based approach with saft-gt toolchain,

    I. Pekaric, R. Groner, A. Raschke, T. Witte, J. G. Adigun, M. Felderer, and M. Tichy, “Bridging safety and security in complex systems: A model-based approach with saft-gt toolchain,” Journal of Systems and Software, 2026

  14. [14]

    Enhancing relationships between crim- inology and cybersecurity,

    B. Dupont and C. Whelan, “Enhancing relationships between crim- inology and cybersecurity,”Journal of Criminology, 2021

  15. [15]

    Uncovering the trust signals supporting telegram’s cybercrime economy,

    R. Ricaldi, T. Marjanov, L. Allodi, and A. Hutchings, “Uncovering the trust signals supporting telegram’s cybercrime economy,” in 2025 eCrime, 2025, pp. 1–17

  16. [16]

    Analysis of security mechanisms of dark web markets,

    Y . Wang, B. Arief, and J. Hernandez-Castro, “Analysis of security mechanisms of dark web markets,” inProceedings of the 2024 EICC. NY , USA: Association for Computing Machinery, 2024

  17. [17]

    Cybercriminal networks, social ties and online forums: Social ties versus digital ties within phishing and malware networks,

    E. R. Leukfeldt, E. R. Kleemans, and W. P. Stol, “Cybercriminal networks, social ties and online forums: Social ties versus digital ties within phishing and malware networks,”The British Journal of Criminology, vol. 57, no. 3, pp. 704–722, 2017

  18. [18]

    A comprehensive study on emerging trends of dark web marketplaces and forums,

    F. Hasanti, M. Z. Osman, M. H. Rahman, M. Z. A. Darus, and N. B. Mohd, “A comprehensive study on emerging trends of dark web marketplaces and forums,” in2024 IEEE ICOCO, 2024

  19. [19]

    Where is dmitry going? framing ’migratory’ decisions in the criminal underground,

    L. Allodi, R. Ricaldi, J. Wientjes, and A. Radu, “Where is dmitry going? framing ’migratory’ decisions in the criminal underground,”

  20. [20]

    Available: https://arxiv.org/abs/2411.16291 9

    [Online]. Available: https://arxiv.org/abs/2411.16291 9

  21. [21]

    Sultana and A

    J. Sultana and A. K. Jilani,Exploring and Analysing Surface, Deep, Dark Web and Attacks. Cham: Springer, 2021, pp. 97–108

  22. [22]

    Dark Web: A Web of Crimes,

    S. Kaur and S. Randhawa, “Dark Web: A Web of Crimes,”Wireless Personal Communications, vol. 112, no. 4, Jun. 2020

  23. [23]

    Kavallieros, D

    D. Kavallieros, D. Myttas, E. Kermitsis, E. Lissaris, G. Giataganas, and E. Darra,Understanding the Dark Web. Cham: Springer International Publishing, 2021, pp. 3–26

  24. [24]

    You can tell a cybercriminal by the company they keep: A framework to infer the relevance of underground communities to the threat landscape,

    M. Campobasso, R. R ˘adulescu, S. Brons, and L. Allodi, “You can tell a cybercriminal by the company they keep: A framework to infer the relevance of underground communities to the threat landscape,”arXiv preprint arXiv:2306.05898, 2023

  25. [25]

    A social network analysis and comparison of six dark web forums,

    I. Pete, J. Hughes, Y . T. Chua, and M. Bada, “A social network analysis and comparison of six dark web forums,” in2020 IEEE EuroS&PW, 2020, pp. 484–493

  26. [26]

    The dark web and anonymizing technolo- gies: legal pitfalls, ethical prospects, and policy directions from radical criminology,

    S. Davis and B. Arrigo, “The dark web and anonymizing technolo- gies: legal pitfalls, ethical prospects, and policy directions from radical criminology,”Crime, Law and Social Change, vol. 76, no. 4, pp. 367–386, Nov 2021

  27. [27]

    A qualitative mapping of darkweb marketplaces,

    D. Georgoulias, J. M. Pedersen, M. Falch, and E. Vasilomanolakis, “A qualitative mapping of darkweb marketplaces,” in2021 APWG eCrime, 2021, pp. 1–15

  28. [28]

    Kermitsis, D

    E. Kermitsis, D. Kavallieros, D. Myttas, E. Lissaris, and G. Gi- ataganas,Dark Web Markets. Cham: Springer International Publishing, 2021, pp. 85–118

  29. [29]

    Extraction of actionable threat intelligence from dark web data,

    V . Varghese, S. Mahalakshmi, and S. Kb, “Extraction of actionable threat intelligence from dark web data,” inICCC. IEEE, 2023

  30. [30]

    The language of legal and illegal activity on the darknet,

    L. Choshen, D. Eldad, D. Hershcovich, E. Sulem, and O. Abend, “The language of legal and illegal activity on the darknet,” 01 2019

  31. [31]

    Structure and Content of the Visible Darknet

    G. Avarikioti, R. Brunner, A. Kiayias, R. Wattenhofer, and D. Zin- dros, “Structure and content of the visible darknet,”arXiv preprint arXiv:1811.01348, 2018

  32. [32]

    Shedding new light on the language of the dark web,

    Y . Jin, E. Jang, Y . Lee, S. Shin, and J.-W. Chung, “Shedding new light on the language of the dark web,” inProceedings of the 2022 conference of the north American chapter of the association for computational linguistics: human language technologies, 2022

  33. [33]

    Darknet as a source of cyber intelligence: Survey, taxonomy, and characterization,

    C. Fachkha and M. Debbabi, “Darknet as a source of cyber intelligence: Survey, taxonomy, and characterization,”IEEE Com- munications Surveys & Tutorials, vol. 18, no. 2, 2016

  34. [34]

    An experimental design to investigate attacker actions on an access-as-a-service ‘criminal’ platform,

    R. Ricaldi, Y . Yalamov, M. Campobasso, L. Allodi, H. Kool, A. Moneva, and E. R. Leukfeldt, “An experimental design to investigate attacker actions on an access-as-a-service ‘criminal’ platform,” in2025 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), 2025, pp. 109–114

  35. [35]

    Into the deep web: Understanding e-commercefraud from autonomous chat with cy- bercriminals,

    P. W. Wang, X. L. Liao, Y . Qin, and X. Wang, “Into the deep web: Understanding e-commercefraud from autonomous chat with cy- bercriminals,” inProceedings of the ISOC Network and Distributed System Security Symposium (NDSS), 2020, 2020

  36. [36]

    Los Angeles: SAGE, Jan

    Klaus Krippendorff,Content Analysis : An Introduction to Its Methodology, fourth edition ed. Los Angeles: SAGE, Jan. 2019

  37. [37]

    Methodolo- gies for data collection and analysis of dark web forum content: A systematic literature review,

    L. De-Marcos, J.-A. Medina-Merodio, and Z. Stapic, “Methodolo- gies for data collection and analysis of dark web forum content: A systematic literature review,”Electronics, vol. 14, no. 21, p. 4191, 2025

  38. [38]

    Information extraction from darknet market advertisements and forums,

    C. Heistracher, S. Schlarb, and F. Ghaffar, “Information extraction from darknet market advertisements and forums,” inProceedings of the 14th international Conference on emerging security infor- mation, systems and Technologies (SECURWARE 2020), 2020

  39. [39]

    Ex- ploring the dark web for cyber threat intelligence using machine leaning,

    M. Kadoguchi, S. Hayashi, M. Hashimoto, and A. Otsuka, “Ex- ploring the dark web for cyber threat intelligence using machine leaning,” in2019 IEEE International Conference on Intelligence and Security Informatics (ISI). IEEE, 2019, pp. 200–202

  40. [40]

    A litera- ture review on mining cyberthreat intelligence from unstructured texts,

    M. R. Rahman, R. Mahdavi-Hezaveh, and L. Williams, “A litera- ture review on mining cyberthreat intelligence from unstructured texts,” in2020 ICDMW. IEEE, 2020, pp. 516–525

  41. [41]

    Security Barriers to Trustworthy AI-Driven Cyber Threat Intelligence in Finance: Evidence from Practitioners,

    E. Karaosman, A. Rizvani, and I. Pekaric, “Security Barriers to Trustworthy AI-Driven Cyber Threat Intelligence in Finance: Evidence from Practitioners,” inThe Sixteenth ACM Conference on Data and Application Security and Privacy (CODASPY), 2026

  42. [42]

    New cyber threat discovery from darknet marketplaces,

    F. Dong, S. Yuan, H. Ou, and L. Liu, “New cyber threat discovery from darknet marketplaces,” in2018 IEEE Conference on Big Data and Analytics (ICBDA). IEEE, 2018, pp. 62–67

  43. [43]

    Darkembed: Exploit prediction with neural language models,

    N. Tavabi, P. Goyal, M. Almukaynizi, P. Shakarian, and K. Lerman, “Darkembed: Exploit prediction with neural language models,” in Proceedings of the AAAI Conference, vol. 32, no. 1, 2018

  44. [44]

    Lstm and bert based transformers models for cyber threat intelligence for intent identification of social media platforms exploitation from darknet forums,

    K. S. Sangher, A. Singh, and H. M. Pandey, “Lstm and bert based transformers models for cyber threat intelligence for intent identification of social media platforms exploitation from darknet forums,”International Journal of Information Technology, 2024

  45. [45]

    To- wards safe cyber practices: Developing a proactive cyber-threat intelligence system for dark web forum content by identifying cybercrimes,

    K. S. Sangher, A. Singh, H. M. Pandey, and V . Kumar, “To- wards safe cyber practices: Developing a proactive cyber-threat intelligence system for dark web forum content by identifying cybercrimes,”Information, vol. 14, no. 6, p. 349, 2023

  46. [46]

    Sentiment analysis of hacker forums with deep learning to predict potential cyberattacks,

    B. Mardassa, A. Beza, A. Al Madhan, and M. Aldwairi, “Sentiment analysis of hacker forums with deep learning to predict potential cyberattacks,” in2024 15th Annual Undergraduate Research Con- ference on Applied Computing (URC). IEEE, 2024, pp. 1–6

  47. [47]

    Jiang, C

    D. Jiang, C. Zhang, and Y . Song,Topic Models. Singapore: Springer Nature Singapore, 2023, pp. 27–46

  48. [48]

    BERTopic: Neural topic modeling with a class-based TF-IDF procedure

    M. Grootendorst, “Bertopic: Neural topic modeling with a class-based tf-idf procedure,” 2022. [Online]. Available: https: //arxiv.org/abs/2203.05794

  49. [49]

    Machine learning tech- niques for the classification of product descriptions from darknet marketplaces

    C. Heistracher, F. Mignet, and S. Schlarb, “Machine learning tech- niques for the classification of product descriptions from darknet marketplaces.” inICAI, 2020, pp. 128–137

  50. [50]

    Dark web text classification by learning through svm optimization,

    C. A. Murty and P. H. Rughani, “Dark web text classification by learning through svm optimization,”Journal of Advances in Information Technology, vol. 13, no. 6, pp. 624–631, 2022

  51. [51]

    Automated categorization of onion sites for analyzing the darkweb ecosystem,

    S. Ghosh, A. Das, P. Porras, V . Yegneswaran, and A. Gehani, “Automated categorization of onion sites for analyzing the darkweb ecosystem,” inProceedings of the 23rd ACM SIGKDD, 2017

  52. [52]

    A big data architecture for early identification and categorization of dark web sites,

    J. Pastor Galindo, H.- ˆA. Sandlin, F. G. M ´armol, G. Bovet, and G. M. P ´erez, “A big data architecture for early identification and categorization of dark web sites,”Future Generation Computer Systems, vol. 157, pp. 67–81, 2024

  53. [53]

    A comparative analysis of models for dark web data classification,

    A. Dalvi, A. Shah, P. Desai, R. Chavan, and S. Bhirud, “A comparative analysis of models for dark web data classification,” inInternational Joint Conference on Advances in Computational Intelligence. Springer, 2022, pp. 245–257

  54. [54]

    Dark side of the web: Dark web classification based on textcnn and topic modeling weight,

    G.-Y . Shin, Y . Jang, D.-W. Kim, S. Park, A.-R. Park, Y . Kim, and M.-M. Han, “Dark side of the web: Dark web classification based on textcnn and topic modeling weight,”IEEE Access, 2023

  55. [55]

    Amoc: A multifaceted machine learning-based toolkit for analysing cybercriminal communities on the darknet,

    C. Chen, C. Peersman, M. Edwards, Z. Ursani, and A. Rashid, “Amoc: A multifaceted machine learning-based toolkit for analysing cybercriminal communities on the darknet,” in2021 IEEE International Conference on Big Data. IEEE, 2021

  56. [56]

    Sentiment & pattern analysis for identifying nature of the content hosted in the dark web,

    C. Murty and P. H. Rughani, “Sentiment & pattern analysis for identifying nature of the content hosted in the dark web,”Indian J. Comput Sci Eng, vol. 12, no. 6, 2021

  57. [57]

    Discovering topics from dark websites,

    L. Yang, F. Liu, J. M. Kizza, and R. K. Ege, “Discovering topics from dark websites,” inProceedings of the IEEE Symposium on Computational Intelligence in Cyber Security, 2009

  58. [58]

    Topic-based social network analysis for virtual communities of interests in the dark web,

    G. L’Huillier, H. Alvarez, S. A. R´ıos, and F. Aguilera, “Topic-based social network analysis for virtual communities of interests in the dark web,” inACM SIGKDD Explorations Newsletter, 2011

  59. [59]

    Darkonto: An ontology construc- tion approach for dark web community discussions through topic modeling and ontology learning,

    R. Basheer and B. Alkhatib, “Darkonto: An ontology construc- tion approach for dark web community discussions through topic modeling and ontology learning,”Human Behavior and Emerging Technologies, 2024

  60. [60]

    Analyzing a dark web forum page in the context of terrorism: a topic modeling approach,

    E. S ¨onmez and K. Sec ¸kin Codal, “Analyzing a dark web forum page in the context of terrorism: a topic modeling approach,” Security Journal, vol. 37, no. 4, pp. 1360–1381, 2024

  61. [61]

    Evolution of dark web threat analysis and detection: A systematic approach,

    S. Nazah, S. Huda, J. Abawajy, and M. M. Hassan, “Evolution of dark web threat analysis and detection: A systematic approach,” IEEE Access, vol. 8, pp. 171 796–171 819, 2020

  62. [62]

    Darkbert: A language model for the dark side of the internet,

    Y . Jin, E. Jang, J. Cui, J.-W. Chung, Y . Lee, and S. Shin, “Darkbert: A language model for the dark side of the internet,” inProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, pp. 7515–7533

  63. [63]

    Llms in cybersecurity: Friend or foe in the human decision loop?

    I. Pekaric, P. Zech, and T. Mattson, “Llms in cybersecurity: Friend or foe in the human decision loop?”arXiv preprint arXiv:2509.06595, 2025

  64. [64]

    Ethics guidelines for internet- mediated research

    C. Hewson and T. Buchanan, “Ethics guidelines for internet- mediated research.” The British Psychological Society, 2013

  65. [65]

    Ethical approaches to studying cybercrime: considerations, practice and experience in the united kingdom,

    B. Pickering, S. Roth, and C. Webber, “Ethical approaches to studying cybercrime: considerations, practice and experience in the united kingdom,” inResearching Cybercrimes: Methodologies, Ethics, and Critical Approaches. Springer, 2021, pp. 347–369

  66. [66]

    Department-Specific Security Awareness Campaigns: A Cross-Organizational Study of HR and Accounting,

    M. Pfister, G. Apruzzese, and I. Pekaric, “Department-Specific Security Awareness Campaigns: A Cross-Organizational Study of HR and Accounting,” in2025 APWG Symposium on Electronic Crime Research (eCrime), 2025

  67. [67]

    “bot lane noob

    J. Ave, I. Pekaric, M. Frohner, and G. Apruzzese, ““bot lane noob”: Towards Practical Deployment of NLP-based Toxicity Detectors in Video Games,” inEuropean Symposium on Research in Computer Security (ESORICS), 2026. 10 Appendix A. A.1. Reproducibility Overview This section summarizes the configuration required to reproduce the content analysis pipeline. ...