Automatic Labeling for Entity Extraction in Cyber Security

Robert A. Bridges , Corinne L. Jones , Michael D. Iannacone , Kelly M. Testa , John R. Goodall

Authors on Pith no claims yet

classification 💻 cs.IR cs.CL

keywords dataextractioncorpuscyber-securityentitiesinformationrelatedsecurity

read the original abstract

Timely analysis of cyber-security information necessitates automated information extraction from unstructured text. While state-of-the-art extraction methods produce extremely accurate results, they require ample training data, which is generally unavailable for specialized applications, such as detecting security related entities; moreover, manual annotation of corpora is very costly and often not a viable solution. In response, we develop a very precise method to automatically label text from several data sources by leveraging related, domain-specific, structured data and provide public access to a corpus annotated with cyber-security entities. Next, we implement a Maximum Entropy Model trained with the average perceptron on a portion of our corpus ($\sim$750,000 words) and achieve near perfect precision, recall, and accuracy, with training times under 17 seconds.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

What Are Adversaries Doing? Automating Tactics, Techniques, and Procedures Extraction: A Systematic Review
cs.SE 2026-04 accept novelty 5.0

Systematic review of 80 papers shows TTP extraction shifting to transformer and LLM methods but limited by narrow datasets, single-label focus, and low reproducibility.