arxiv: 2604.12228 · v1 · submitted 2026-04-14 · 💻 cs.CR

Recognition: unknown

From IOCs to Regex: Automating CTI Operationalization for SOC with LLMs

Pei-Yu Tseng (1) , Lan Zhang (2) , ZihDwo Yeh (1) , Xiaoyan Sun (3) , Xushu Dai (1) , Peng Liu (1) ((1) The Pennsylvania State University , USA , (2) Northern Arizona University

show 2 more authors

(3) Worcester Polytechnic Institute USA)

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:03 UTC · model grok-4.3

classification 💻 cs.CR

keywords cyber threat intelligenceindicators of compromiseregular expressionslarge language modelssecurity operationsautomated regex generationCTI operationalization

0 comments

The pith

IOCRegex-gen automatically converts indicators of compromise from cyber threat reports into regular expressions using large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IOCRegex-gen to solve the manual conversion of Indicators of Compromise from CTI reports into regexes for use in log parsing, forensics, and SIEM rules. Current practice is slow and error-prone as CTI volumes increase, and prior LLM work only extracts plain IOC strings that fail to handle log variations or attacker changes. The system adds a group-aware mechanism to decide which IOC parts become capture groups and an iterative reasoning pipeline with multi-stage checks for syntactic and semantic correctness. Tests across more than three thousand real CTI reports and ground-truth strings from the MITRE ATT&CK framework produce a 99.1 percent average hit rate with 0.8 percent false positives.

Core claim

IOCRegex-gen converts IOCs extracted from CTI reports into regexes through a group-aware mechanism that identifies segments for capture or non-capture groups plus an iterative reasoning and multi-stage validation pipeline that enforces syntactic validity and semantic correctness, reaching 99.1 percent hit rate and 0.8 percent false-positive rate on thousands of real reports and MITRE ATT&CK ground-truth strings.

What carries the argument

IOCRegex-gen, an LLM-based pipeline that uses a group-aware mechanism to classify IOC segments into capture or non-capture groups and an iterative reasoning process with validation stages to produce usable regexes.

If this is right

SOC teams can process growing volumes of CTI reports into operational regex rules without manual effort.
Regex patterns generated this way can capture variations in log formats and attacker behaviors more reliably than plain IOC strings.
Digital forensics and SIEM rule creation become faster and less error-prone at scale.
The same pipeline could support repeated regeneration of regexes as threats evolve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to generate other detection artifacts such as YARA or Sigma rules directly from CTI text.
Integration into existing security tools might allow near real-time operationalization of incoming threat reports.
Over time, the system could reduce the need for specialized analysts on routine CTI-to-rule conversion tasks.

Load-bearing premise

The group-aware mechanism and iterative reasoning pipeline will keep producing semantically correct regexes that work across different log formats, system contexts, and changing attacker tactics without needing extra human fixes.

What would settle it

A set of new CTI reports containing IOCs in previously unseen log formats where the generated regexes either miss valid matches or trigger many false positives on real SOC log data.

Figures

Figures reproduced from arXiv: 2604.12228 by (2) Northern Arizona University, (3) Worcester Polytechnic Institute, Lan Zhang (2), Pei-Yu Tseng (1), Peng Liu (1) ((1) The Pennsylvania State University, USA, USA), Xiaoyan Sun (3), Xushu Dai (1), ZihDwo Yeh (1).

**Figure 1.** Figure 1: Example paragraphs from a CTI report a system has been breached. However, not all types of IOCs are suitable for regex generation. According to the Pyramid of Pain [20], the low-value indicators—such as hash values, IP addresses, domain names, and network artifacts—are the ones most easily changed by adversaries. These indicators are highly volatile, as attackers can regenerate file hashes, rotate or prox… view at source ↗

**Figure 2.** Figure 2: Motivating Example be overcome. As discussed in Section 2.1.2, prior research has primarily focused on automatic IOC extraction. In this section, we focus on the remaining challenges specific to the regex generation workflow. First, current LLMs lack sufficient knowledge to distinguish which parts of an IOC belong to the capture group versus the non-capture group (C1). Unlike experienced analysts, who re… view at source ↗

**Figure 3.** Figure 3: Overview of IOCRegex-gen step that isolates invariant IOC fragments from those likely to vary across environments or attacker campaigns. In this phase, IOCRegex-gen retrieves data from an external graph database and uses our proprietary algorithm to differentiate between capture groups and non-capture groups within each string extracted during IOCs Extraction. Additionally, when encountering any string tha… view at source ↗

**Figure 4.** Figure 4: The tree structure of command line argument and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Workflow of Reasoning-based Regex Generation [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: The scores of the regular expressions for each [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Similarity comparison between regular expressions [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Cyber Threat Intelligence (CTI) reports contain Indicators of Compromise (IOCs) that are critical for security operations. To operationalize these IOCs across heterogeneous logs, analysts often convert them into regular expressions (regexes) for tasks such as digital forensics, log parsing, and SIEM rule creation. However, regex construction is still largely manual, requiring analysts to extract IOCs from CTI reports and transform them into syntactically valid and semantically precise patterns. This process is slow, error-prone, and increasingly impractical as CTI volumes grow. Although recent studies have applied Large Language Models (LLMs) to IOC extraction, they typically output plain strings rather than regexes, limiting practical deployment. Plain IOCs cannot effectively capture variations in system context, log format, or attacker behavior. To address this gap, we propose IOCRegex-gen, a fully automated LLM-based regex generation system that converts IOCs into regexes. The system introduces two key innovations: (i) a group-aware mechanism that identifies which IOC segments should be represented as capture or non-capture groups, and (ii) an iterative reasoning and multi-stage validation pipeline to ensure syntactic validity and semantic correctness. Experiments on over 3,000 real CTI reports and 2,400 ground-truth strings from the MITRE ATT&CK Evaluation framework show that IOCRegex-gen achieves an average hit rate of 99.1% and a false-positive rate of only 0.8%, demonstrating its effectiveness for large-scale CTI processing and automated regex generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a practical LLM pipeline for turning IOCs into regexes with two engineering tweaks and reports strong numbers on a big dataset, but the evaluation setup leaves real generalization to messy logs unproven.

read the letter

The main thing here is that IOCRegex-gen uses LLMs to convert indicators from CTI reports into regex patterns instead of plain strings, adding a group-aware step to pick capture groups and an iterative validation loop to fix syntax and semantics. That combination is the actual new piece relative to earlier extraction work, and it targets a clear operational bottleneck for SOC teams who need patterns that handle log variations.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes IOCRegex-gen, an LLM-based system to automatically convert Indicators of Compromise (IOCs) extracted from Cyber Threat Intelligence (CTI) reports into regular expressions suitable for SOC tasks such as log parsing and SIEM rule creation. It introduces two innovations: a group-aware mechanism to decide capture versus non-capture groups for IOC segments, and an iterative reasoning plus multi-stage validation pipeline to enforce syntactic validity and semantic correctness. Evaluation is reported on more than 3,000 real CTI reports together with 2,400 ground-truth strings drawn from the MITRE ATT&CK Evaluation framework, yielding an average hit rate of 99.1% and false-positive rate of 0.8%.

Significance. If the reported metrics are shown to be robust under proper controls, the work would address a concrete operational bottleneck: the manual, error-prone conversion of raw IOC strings into regexes that tolerate log-format variation and attacker TTP evolution. Successful automation at this scale could materially improve the speed and consistency with which CTI is operationalized in security operations centers.

major comments (3)

Abstract: the effectiveness claim rests on a 99.1% hit rate and 0.8% FPR, yet the abstract (and, by extension, the evaluation description) supplies no baselines, no ablation of the group-aware mechanism versus the iterative pipeline, no definition of how FPR is computed against negative examples, and no error analysis or edge-case handling; without these the numbers cannot be assessed as evidence of generalization.
Evaluation (MITRE ATT&CK strings): the 2,400 ground-truth strings are static and drawn from a curated framework; the manuscript does not report out-of-distribution tests on raw heterogeneous logs, format drift, encoding differences, or post-2023 attacker IOC variations, leaving the central generalization claim for SOC deployment unsupported.
Method (group-aware + iterative pipeline): no quantitative evidence is provided that the two innovations are necessary or sufficient; an ablation removing each component in turn would be required to establish that the reported performance is attributable to the proposed mechanisms rather than to the base LLM.

minor comments (2)

Abstract: the phrase 'fully automated' is used while the pipeline description implies multiple LLM calls and validation stages; a brief clarification of what 'fully automated' means in practice would improve precision.
The manuscript would benefit from an explicit statement of the exact prompt templates and temperature settings used for the LLM calls, as these are load-bearing for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, explaining our position and the changes we will make to improve the manuscript's clarity and rigor.

read point-by-point responses

Referee: Abstract: the effectiveness claim rests on a 99.1% hit rate and 0.8% FPR, yet the abstract (and, by extension, the evaluation description) supplies no baselines, no ablation of the group-aware mechanism versus the iterative pipeline, no definition of how FPR is computed against negative examples, and no error analysis or edge-case handling; without these the numbers cannot be assessed as evidence of generalization.

Authors: We agree that the abstract and evaluation description would be strengthened by including baselines, a definition of FPR, and error analysis. In the revised manuscript we will update the abstract to reference comparisons against direct LLM prompting and basic string-to-regex heuristics. We will add an explicit definition of the false-positive rate (generated regexes tested on negative log samples without the IOC, counting unintended matches) and include a new error-analysis subsection covering edge cases such as encoded IOCs, special characters, and ambiguous segments together with how the multi-stage validation mitigates them. revision: yes
Referee: Evaluation (MITRE ATT&CK strings): the 2,400 ground-truth strings are static and drawn from a curated framework; the manuscript does not report out-of-distribution tests on raw heterogeneous logs, format drift, encoding differences, or post-2023 attacker IOC variations, leaving the central generalization claim for SOC deployment unsupported.

Authors: The 3,000+ real CTI reports already introduce substantial heterogeneity in format, encoding, and context beyond the curated MITRE strings. Nevertheless, we acknowledge the value of more explicit out-of-distribution testing. In the revision we will add a dedicated experiment using a held-out set of post-2023 CTI reports and logs that exhibit format drift and encoding variations, reporting hit rate and FPR on this set to further substantiate generalization for SOC use. revision: partial
Referee: Method (group-aware + iterative pipeline): no quantitative evidence is provided that the two innovations are necessary or sufficient; an ablation removing each component in turn would be required to establish that the reported performance is attributable to the proposed mechanisms rather than to the base LLM.

Authors: We recognize that quantitative ablations are required to isolate the contribution of each component. We will perform and report two ablation experiments in the revised manuscript: (1) replacing the group-aware mechanism with default capture groups for all segments, and (2) disabling the iterative reasoning and multi-stage validation pipeline in favor of single-pass generation. The resulting hit rates and false-positive rates will demonstrate the necessity of both innovations relative to the base LLM. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external MITRE ground truth

full rationale

The paper describes an LLM-based engineering system (group-aware mechanism + iterative validation pipeline) whose central claims are validated by direct comparison to 2,400 external MITRE ATT&CK strings and 3,000 CTI reports. No equations, fitted parameters, self-citations, or internal definitions are used to derive the reported hit rate or FPR; the metrics are computed against independent ground-truth data. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the approach relies on standard LLM capabilities plus the two described mechanisms.

pith-pipeline@v0.9.0 · 5635 in / 1071 out tokens · 48803 ms · 2026-05-10T16:03:49.079833+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Threat intelligence global market report

“Threat intelligence global market report.” [Online]. Available: https://www.thebusinessresearchcompany.com/report/threat-intellige nce-global-market-report#: ∼:text=The%20threat%20intelligence%2 0market%20size%20is%20expected%20to%20see%20rapid,intellig ence%2C%20focus%20on%20cloud%20security
[2]

TINKER: A framework for Open source Cyberthreat Intelligence,

N. Rastogi, S. Dutta, A. Gittens, M. J. Zaki, and C. Aggarwal, “TINKER: A framework for Open source Cyberthreat Intelligence,” Tech. Rep
[3]

TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI Sources,

G. Husari, E. Al-Shaer, M. Ahmed, B. Chu, and X. Niu, “TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI Sources,” inACM International Conference Proceeding Series, vol. Part F132521. Association for Computing Machinery, dec 2017, pp. 103–115

2017
[4]

Llm-tikg: Threat intelligence knowledge graph construction utilizing large language model,

Y . Hu, F. Zou, J. Han, X. Sun, and Y . Wang, “Llm-tikg: Threat intelligence knowledge graph construction utilizing large language model,”Computers & Security, vol. 145, p. 103999, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S01674 04824003043

2024
[5]

Llmcloudhunter: Harnessing llms for automated extraction of detection rules from cloud-based cti,

Y . Schwartz, L. Benshimol, D. Mimran, Y . Elovici, and A. Shabtai, “Llmcloudhunter: Harnessing llms for automated extraction of detection rules from cloud-based cti,”Waikiki ’24: Annual Computer Security Applications Conference, December 09 ˆa•fi13, 2024, Waikiki, Hawaii, USA, vol. 1, 2024. [Online]. Available: http://arxiv.org/abs/2407.05194

work page arXiv 2024
[6]

Yucheng Zhou, Jihai Zhang, Guanjie Chen, Jianbing Shen, and Yu Cheng

M. Xu, H. Wang, J. Liu, Y . Lin, C. X. Y . Liu, H. W. Lim, and J. S. Dong, “Intelex: A llm-driven attack-level threat intelligence extraction framework,” 2024. [Online]. Available: http://arxiv.org/abs/2412.10872

work page arXiv 2024
[7]

Rulepilot: An llm-powered agent for security rule generation,

H. Wang, M. Xu, Y . Guo, W. Han, H. W. Lim, and J. S. Dong, “Rulepilot: An llm-powered agent for security rule generation,” inProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE ’26), 2026. [Online]. Available: https://doi.org/10.1145/3744916.3773249

work page doi:10.1145/3744916.3773249 2026
[8]

Neural generation of regular expressions from natural language with minimal domain knowledge,

N. Locascio, K. Narasimhan, E. DeLeon, N. Kushman, and R. Barzilay, “Neural generation of regular expressions from natural language with minimal domain knowledge,” 2016. [Online]. Available: https://arxiv.org/abs/1608.03000

work page arXiv 2016
[9]

Multi- modal synthesis of regular expressions,

Q. Chen, X. Wang, X. Ye, G. Durrett, and I. Dillig, “Multi- modal synthesis of regular expressions,” 2020. [Online]. Available: https://arxiv.org/abs/1908.03316

work page arXiv 2020
[10]

Infere: Step-by-step regex generation via chain of inference,

S. Zhang, X. Gu, Y . Chen, and B. Shen, “Infere: Step-by-step regex generation via chain of inference,” 2023. [Online]. Available: https://arxiv.org/abs/2308.04041

work page arXiv 2023
[11]

Enhancing multi-modal regular expression synthesis via large language models and semantic manipulations of sub-expressions,

Z. Tang, Y . Yan, R. Li, H. Dong, H. Chen, and H. Gao, “Enhancing multi-modal regular expression synthesis via large language models and semantic manipulations of sub-expressions,” inSETTA, 2024, pp. 122–141. [Online]. Available: https://doi.org/10.1007/978-981-9 6-0602-3 7

work page doi:10.1007/978-981-9 2024
[12]

Mitre att&ck evaluation

M. Corporation, “Mitre att&ck evaluation.” [Online]. Available: https://attackevals.mitre-engenuity.org/
[13]

Tracking the activities of teamtnt

D. Fiser and A. Oliveira, “Tracking the activities of teamtnt.” [Online]. Available: https://documents.trendmicro.com/assets/white papers/wp-tracking-the-activities-of-teamTNT.pdf
[14]

Unveiling earth kapre aka redcurl’s cyberespionage tactics with trend micro mdr, threat intelligence

M. F. Buddy Tancio, Maria Emreen Viray, “Unveiling earth kapre aka redcurl’s cyberespionage tactics with trend micro mdr, threat intelligence.” [Online]. Available: https://www.trendmicro.com/en u s/research/24/c/unveiling-earth-kapre-aka-redcurls-cyberespionage-t actics-with-t.html
[15]

Triton attribution: Russian government-owned lab most likely built custom intrusion tools for triton attackers

F. Intelligence, “Triton attribution: Russian government-owned lab most likely built custom intrusion tools for triton attackers.” [Online]. Available: https://cloud.google.com/blog/topics/threat-intelligence/tr iton-attribution-russian-government-owned-lab-most-likely-built-too ls/
[16]

Redcurl hackers return to spy on ’major russian bank,’ australian company

D. Antoniuk, “Redcurl hackers return to spy on ’major russian bank,’ australian company.” [Online]. Available: https://therecord.me dia/redcurl-hackers-russian-bank-australian-company
[17]

Astaroth malware uses legitimate os and antivirus processes to steal passwords and personal data

E. Salem, “Astaroth malware uses legitimate os and antivirus processes to steal passwords and personal data.” [Online]. Available: https://www.cybereason.com/blog/information-stealing-malware-tar geting-brazil-full-research
[18]

Mitre att&ck

Mitre, “Mitre att&ck.” [Online]. Available: https://attack.mitre.org/
[19]

Trend micro threat encyclopedia

T. Micro, “Trend micro threat encyclopedia.” [Online]. Available: https://www.trendmicro.com/vinfo/us/threat-encyclopedia#
[20]

What is the pyramid of pain

D. J. Bianco, “What is the pyramid of pain.” [Online]. Available: https://www.attackiq.com/glossary/pyramid-of-pain/
[21]

Enabling Efficient Cyber Threat Hunting With Cyber Threat Intelligence,

P. Gao, F. Shao, X. Liu, X. Xiao, Z. Qin, F. Xu, P. Mittal, S. R. Kulkarni, and D. Song, “Enabling Efficient Cyber Threat Hunting With Cyber Threat Intelligence,” Tech. Rep
[22]

Llm-tikg: Threat intelligence knowledge graph construction utilizing large language model,

Y . Hu, F. Zou, J. Han, X. Sun, and Y . Wang, “Llm-tikg: Threat intelligence knowledge graph construction utilizing large language model,”Computers & Security, vol. 145, p. 103999, 2024

2024
[23]

Ctikg: Llm-powered knowledge graph construction from cyber threat intelligence,

L. Huang and X. Xiao, “Ctikg: Llm-powered knowledge graph construction from cyber threat intelligence,” inFirst Conference on Language Modeling, 2024

2024
[24]

Constructing knowledge graph from cyber threat intelligence using large language model,

J. Liu and J. Zhan, “Constructing knowledge graph from cyber threat intelligence using large language model,” in2023 IEEE International Conference on Big Data (BigData). IEEE, 2023, pp. 516–521

2023
[25]

Towards effective identification of attack techniques in cyber threat intelligence reports using large language models,

H. Cuong Nguyen, S. Tariq, M. Baruwal Chhetri, and B. Quoc V o, “Towards effective identification of attack techniques in cyber threat intelligence reports using large language models,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 942– 946

2025
[26]

Actionable cyber threat intelligence using knowledge graphs and large language models,

R. Fieblinger, M. T. Alam, and N. Rastogi, “Actionable cyber threat intelligence using knowledge graphs and large language models,” in 2024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). IEEE, 2024, pp. 100–111

2024
[27]

Search reference-rex

Splunk, “Search reference-rex.” [Online]. Available: https://docs.spl unk.com/Documentation/Splunk/latest/SearchReference/rex
[28]

Regular expression syntax

Elastic, “Regular expression syntax.” [Online]. Available: https://ww w.elastic.co/docs/reference/query-languages/query-dsl/regexp-syntax
[29]

Common regular expressions

IBM, “Common regular expressions.” [Online]. Available: https://ww w.ibm.com/docs/en/dsm?topic=qradar-common-regular-expressions
[30]

Splunk security content

S. community, “Splunk security content.” [Online]. Available: https://github.com/rapdev-io/Threat Detection Ruleset-SPLUNK?ta b=readme-ov-file
[31]

What is fileless malware

Fortinet, “What is fileless malware.” [Online]. Available: https: //www.fortinet.com/resources/cyberglossary/fileless-malware
[32]

regex101

F. Dib, “regex101.” [Online]. Available: https://regex101.com/
[33]

ReAct: Synergizing Reasoning and Acting in Language Models

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” 2023. [Online]. Available: https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Regex+: Synthesizing regular expressions from positive examples,

E. Pertseva, M. Barbone, J. Rudek, and N. Polikarpova, “Regex+: Synthesizing regular expressions from positive examples,”11TH Workshop on Synthesis. [Online]. Available: https://par.nsf.gov/bibl io/10336574

work page arXiv
[36]

Transregex: Multi-modal regular expression synthesis by generate-and-repair,

Y . Li, S. Li, Z. Xu, J. Cao, Z. Chen, Y . Hu, H. Chen, and S.-C. Cheung, “Transregex: Multi-modal regular expression synthesis by generate-and-repair,” in2021 IEEE/ACM 43rd International Confer- ence on Software Engineering (ICSE), 2021, pp. 1210–1222

2021
[37]

Self-consistency improves chain of thought reasoning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=1PL1NIMMrw

2023
[38]

Actionable cyber threat intelligence using knowledge graphs and large language models,

R. Fieblinger, M. T. Alam, and N. Rastogi, “Actionable cyber threat intelligence using knowledge graphs and large language models,”
[39]

Available: https://arxiv.org/abs/2407.02528

[Online]. Available: https://arxiv.org/abs/2407.02528

work page arXiv
[40]

Microsoft security incident prediction,

S. Freitas, J. Kalajdjieski, A. Gharib, and R. McCann, “Microsoft security incident prediction,” 2024. [Online]. Available: https: //www.kaggle.com/dsv/8929038

work page arXiv 2024
[41]

Loghub: A large collection of system log datasets towards automated log analytics,

J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for ai-driven log analytics,” 2023. [Online]. Available: https://arxiv.org/abs/2008.06448

work page arXiv 2023
[42]

F. P. Miller, A. F. Vandome, and J. McBrewster,Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Ham- ming distance. Alpha Press, 2009

2009
[43]

Playing regex golf with genetic programming,

A. Bartoli, A. De Lorenzo, E. Medvet, and F. Tarlao, “Playing regex golf with genetic programming,” inProceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, ser. GECCO ’14. New York, NY , USA: Association for Computing Machinery, 2014, p. 1063–1070. [Online]. Available: https://doi.org/10.1145/2576768.2598333

work page doi:10.1145/2576768.2598333 2014
[44]

Regex-based entity extraction with active learning and genetic programming,

——, “Regex-based entity extraction with active learning and genetic programming,”SIGAPP Appl. Comput. Rev., vol. 16, no. 2, p. 7–15, Aug. 2016. [Online]. Available: https://doi.org/10.1145/2993231.29 93232

work page doi:10.1145/2993231.29 2016
[45]

A regular expression generator based on css selectors for efficient extraction from html pages,

E. Uzun, “A regular expression generator based on css selectors for efficient extraction from html pages,”Turkish Journal of Electrical Engineering and Computer Sciences, vol. 28, no. 6, pp. 3389–3401, 2020

2020
[46]

Sketch-driven regular expression generation from natural language and examples,

X. Ye, Q. Chen, X. Wang, I. Dillig, and G. Durrett, “Sketch-driven regular expression generation from natural language and examples,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 679–694, 2020

2020
[47]

Understanding regular expression denial of service (redos): Insights from llm- generated regexes and developer forums,

M. L. Siddiq, J. Zhang, and J. C. D. S. Santos, “Understanding regular expression denial of service (redos): Insights from llm- generated regexes and developer forums,” inProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension, ser. ICPC ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 190–201. [Online]. Av...

work page arXiv 2024
[48]

From examples to patterns: Llm-generated regular expressions for entity extraction in czech clinical texts

P. Zelina, “From examples to patterns: Llm-generated regular expressions for entity extraction in czech clinical texts.” [Online]. Available: http://nlp.fi.muni.cz/raslan/2024/paper6.pdf

2024