pith. sign in

arxiv: 2606.21389 · v1 · pith:DIM2ORJGnew · submitted 2026-06-19 · 💻 cs.CR

From Production SIEM to Reusable Cybersecurity Artifacts

Pith reviewed 2026-06-26 13:53 UTC · model grok-4.3

classification 💻 cs.CR
keywords SIEManonymizationcybersecurity artifactsprivacy-utility boundarySOC dataMITRE ATT&CKLLM evaluationincident response
0
0 comments X

The pith

A methodology extracts, anonymizes, structures, and validates production SIEM data from a financial SOC to yield reusable artifacts that preserve task-relevant investigative structure inside a declared privacy boundary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the gap between inaccessible production telemetry and shareable research data as a design problem rather than an insurmountable barrier. It defines a four-step process that pulls logs from a live financial security operations center, applies anonymization, structures the output, and validates it against real tasks. Two concrete tests establish a measurable privacy-utility line: 37 MITRE ATT&CK-mapped challenges succeed only when temporal order and entity links survive the process, and a deterministic verifier on 200 SOCpilot incidents flags LLM actions that deviate from human baselines. A sympathetic reader would see this as a route to replace synthetic datasets with controlled real-world evidence.

Core claim

Operational evidence is not automatically scientific evidence. The most realistic Security Operations Center data is production telemetry, yet it remains scientifically inaccessible because raw logs cannot be released; as a result, research relies on synthetic or dated datasets. We treat the boundary between private production telemetry and reusable research artifacts as the design object: a methodology that extracts, anonymizes, structures, and validates SIEM data from a production financial SOC while preserving task-relevant investigative structure within a declared privacy boundary. Two consumers stress the same artifact. As training material, it fails loudly: 37 MITRE ATT&CK-mapped HIKAR

What carries the argument

The extraction-anonymization-structuring-validation pipeline that maintains temporal order and entity consistency inside an explicit privacy boundary.

If this is right

  • The same anonymized artifacts support both training challenges mapped to MITRE ATT&CK and quantitative measurement of automated incident response.
  • A measurable privacy-utility boundary replaces binary anonymity claims as the evaluation standard for released SOC data.
  • Production telemetry can serve as the substrate for reproducible experiments once temporal and entity structure is preserved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other organizations holding production logs could run equivalent extraction pipelines to enlarge the pool of controlled research artifacts.
  • The dual-use testing approach could be applied to evaluate additional automated tools beyond LLMs on the same incident set.
  • The privacy boundary definition offers a template for sharing data across regulated sectors where full release remains impossible.

Load-bearing premise

The anonymization steps keep temporal order and entity consistency intact enough for the 37 challenges to run correctly and for the verifier to produce meaningful comparisons without the transformations adding hidden biases.

What would settle it

If the 37 HIKARI challenges stop working or the deterministic verifier no longer detects differences between LLM and human actions on the 200 incidents after the anonymization steps are applied, the claimed utility of the artifacts would not hold.

read the original abstract

Operational evidence is not automatically scientific evidence. The most realistic Security Operations Center (SOC) data is production telemetry, yet it remains scientifically inaccessible because raw logs cannot be released; as a result, research relies on synthetic or dated datasets. We treat the boundary between private production telemetry and reusable research artifacts as the design object: a methodology that extracts, anonymizes, structures, and validates Security Information and Event Management (SIEM) data from a production financial SOC while preserving task-relevant investigative structure within a declared privacy boundary. Two consumers stress the same artifact. As training material, it fails loudly: 37 MITRE ATT&CK-mapped HIKARI challenges work only when anonymization preserves temporal order and entity consistency. As a measurement substrate, it fails quietly: across 200 SOCpilot incidents, a deterministic verifier detects non-compliant Large Language Model (LLM) actions that are absent from the human baseline. The result is a measurable privacy-utility boundary rather than a formal anonymity claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce a methodology that extracts, anonymizes, structures, and validates SIEM data from a production financial SOC to produce reusable cybersecurity artifacts that preserve task-relevant investigative structure within a declared privacy boundary. This is demonstrated via two consumers of the same artifact: as training material, 37 MITRE ATT&CK-mapped HIKARI challenges succeed only when anonymization preserves temporal order and entity consistency; as a measurement substrate, a deterministic verifier on 200 SOCpilot incidents detects non-compliant LLM actions absent from the human baseline, yielding a measurable privacy-utility boundary rather than a formal anonymity claim.

Significance. If the central claims hold, the work addresses a key barrier in cybersecurity research by enabling realistic, production-derived datasets for evaluation and training while respecting privacy constraints. The dual-use design (loud failure on challenges, quiet detection on incidents) and concrete scale (37 challenges, 200 incidents) provide a practical, falsifiable demonstration of the privacy-utility trade-off that could influence how future SOC artifacts are shared.

major comments (2)
  1. [Methodology (anonymization description)] The anonymization step (extract-anonymize-structure-validate) is load-bearing for the central claim that task-relevant structure is preserved, yet the manuscript provides no concrete mechanism for global entity-ID remapping that maintains cross-log references or for timestamp jittering that preserves relative ordering and causality; without these, the HIKARI challenge results and SOCpilot verifier outcomes could be artifacts of the pipeline rather than evidence of utility.
  2. [Evaluation / Results] The evaluation reports 37 HIKARI challenges and 200 SOCpilot incidents as concrete outcomes but supplies no error analysis, baseline comparisons against non-anonymized data, or explicit checks that the transformations did not introduce or destroy correlations; this undermines the assertion of a measurable privacy-utility boundary.
minor comments (1)
  1. [Notation / Definitions] Notation for the privacy boundary and the deterministic verifier should be defined more explicitly (e.g., with a short pseudocode listing or table of invariants) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. The comments highlight areas where additional detail will strengthen the manuscript, and we address each point below with proposed revisions.

read point-by-point responses
  1. Referee: [Methodology (anonymization description)] The anonymization step (extract-anonymize-structure-validate) is load-bearing for the central claim that task-relevant structure is preserved, yet the manuscript provides no concrete mechanism for global entity-ID remapping that maintains cross-log references or for timestamp jittering that preserves relative ordering and causality; without these, the HIKARI challenge results and SOCpilot verifier outcomes could be artifacts of the pipeline rather than evidence of utility.

    Authors: We agree that the current description of anonymization is insufficiently concrete. In revision we will expand Section 3 with (a) the precise global entity remapping algorithm, including how cross-log references are maintained via consistent pseudonym assignment and collision avoidance, and (b) the timestamp jitter procedure with explicit bounds and ordering guarantees. These additions will make clear that the reported outcomes depend on preserved structure rather than incidental pipeline effects. revision: yes

  2. Referee: [Evaluation / Results] The evaluation reports 37 HIKARI challenges and 200 SOCpilot incidents as concrete outcomes but supplies no error analysis, baseline comparisons against non-anonymized data, or explicit checks that the transformations did not introduce or destroy correlations; this undermines the assertion of a measurable privacy-utility boundary.

    Authors: We will add an error-analysis subsection detailing failure cases for both the HIKARI challenges and the SOCpilot verifier, plus statistical checks (e.g., correlation matrices on non-sensitive fields) confirming that transformations do not materially alter task-relevant distributions. Direct comparison against the original non-anonymized logs is not feasible; we will instead introduce a control condition using deliberately inconsistent entity IDs as a degraded baseline. revision: partial

standing simulated objections not resolved
  • Direct baseline comparisons against the original non-anonymized production logs, which remain inaccessible due to privacy constraints.

Circularity Check

0 steps flagged

No circularity: claims rest on external validation tasks

full rationale

The paper presents a methodology for extracting and anonymizing SIEM data, validated by success/failure on independent external benchmarks (37 HIKARI challenges requiring preserved order/consistency, and 200 SOCpilot incidents with a deterministic verifier). No equations, self-definitional quantities, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The privacy-utility boundary is demonstrated via these external consumers rather than by construction from the method's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that production telemetry contains preservable task-relevant structure and that the chosen anonymization steps do not invalidate the two consumer validations; no free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Anonymization steps can be chosen such that temporal order and entity consistency are retained at a level usable for MITRE ATT&CK-mapped challenges and incident-response verification.
    Invoked implicitly when stating that the 37 HIKARI challenges work only when these properties are preserved.

pith-pipeline@v0.9.1-grok · 5722 in / 1251 out tokens · 21222 ms · 2026-06-26T13:53:26.422834+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 5 linked inside Pith

  1. [1]

    MITRE ATT&CK,

    The MITRE Corporation, “MITRE ATT&CK,” https://attack.mitre.org, 2026, accessed: 2026-06-14

  2. [2]

    Toward generating a new intrusion detection dataset and intrusion traffic characterization,

    I. Sharafaldin, A. H. Lashkari, A. A. Ghorbani, “Toward generating a new intrusion detection dataset and intrusion traffic characterization,” inInt. Conf. on Information Systems Security and Privacy (ICISSP), 2018

  3. [3]

    UNSW-NB15: A comprehensive data set for network intrusion detection systems,

    N. Moustafa J. Slay, “UNSW-NB15: A comprehensive data set for network intrusion detection systems,” inMilitary Communications and Information Systems Conf. (MilCIS), 2015

  4. [4]

    An empirical comparison of botnet detection methods,

    S. Garc ´ıa, M. Grill, J. Stiborek, A. Zunino, “An empirical comparison of botnet detection methods,”Computers & Security, vol. 45, 2014

  5. [5]

    Comprehensive, multi-source cyber-security events data set,

    A. D. Kent, “Comprehensive, multi-source cyber-security events data set,” Los Alamos National Laboratory, Tech. Rep., 2015

  6. [6]

    Bridging the gap: A pragmatic approach to ge- nerating insider threat data,

    J. Glasser B. Lindauer, “Bridging the gap: A pragmatic approach to ge- nerating insider threat data,” inIEEE Security and Privacy Workshops, 2013

  7. [7]

    Transparent Computing,

    Defense Advanced Research Projects Agency, “Transparent Computing,” Program documentation, 2026, accessed: 2026-06-

  8. [8]

    Available: https://www.darpa.mil/research/programs/ transparent-computing

    [Online]. Available: https://www.darpa.mil/research/programs/ transparent-computing

  9. [9]

    Boss of the SOC Dataset Version 2,

    Splunk, “Boss of the SOC Dataset Version 2,” Dataset repository, 2017, accessed: 2026-06-17. [Online]. Available: https://github.com/ splunk/botsv2

  10. [10]

    Tools and benchmarks for automated log parsing,

    J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, M. R. Lyu, “Tools and benchmarks for automated log parsing,” inInt. Conf. on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2019

  11. [11]

    k-anonymity: A model for protecting privacy,

    L. Sweeney, “k-anonymity: A model for protecting privacy,”Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5, 2002

  12. [12]

    AutoSUT: The environment semantics gap in structured CTI for adversary emulation,

    S. Barbieri, ´A. L. R. Ferraz, L. A. Pereira J ´unior, “AutoSUT: The environment semantics gap in structured CTI for adversary emulation,”

  13. [13]

    Available: https://arxiv.org/abs/2606.08700

    [Online]. Available: https://arxiv.org/abs/2606.08700

  14. [14]

    SOCpilot: Verifying policy compliance for LLM-assisted incident response,

    S. Barbieri, L. V . d. Meneses, ´A. L. R. Ferraz, L. A. Pereira J ´unior, “SOCpilot: Verifying policy compliance for LLM-assisted incident response,” 2026. [Online]. Available: https://arxiv.org/abs/2605.05501

  15. [15]

    Prefix-preserving IP address anonymization,

    J. Xu, J. Fan, M. H. Ammar, S. B. Moon, “Prefix-preserving IP address anonymization,” inIEEE Int. Conf. on Network Protocols (ICNP), 2002

  16. [16]

    Hikari: A gamified cyber-range platform for defensive training,

    S. Barbieri, ´A. L. R. Ferraz, L. A. Pereira J ´unior, “Hikari: A gamified cyber-range platform for defensive training,” 2026, lab artifact; paper in preparation

  17. [17]

    True Attacks, Attack Attempts, or Benign Triggers? An Empirical Measurement of Network Alerts in a Security Operations Center,

    L. Yang, Z. Chen, C. Wang, Z. Zhang, S. Booma, P. Cao, C. Adam, A. Withers, Z. Kalbarczyk, R. K. Iyer, G. Wang, “True Attacks, Attack Attempts, or Benign Triggers? An Empirical Measurement of Network Alerts in a Security Operations Center,” in 33rd USENIX Security Symposium. USENIX Association, 2024, pp. 1525–1542. [Online]. Available: https://www.usenix....

  18. [18]

    99% False Positives: A Qualitative Study of SOC Analysts’ Perspectives on Security Alarms,

    B. A. AlAhmadi, L. Axon, I. Martinovic, “99% False Positives: A Qualitative Study of SOC Analysts’ Perspectives on Security Alarms,” in31st USENIX Security Symposium. USENIX Association, 2022, pp. 2783–2800. [Online]. Available: https://www.usenix.org/ conference/usenixsecurity22/presentation/alahmadi

  19. [19]

    Ruling the Rules: Quantifying the Evolution of Rulesets, Alerts and Incidents in Network Intrusion Detection,

    M. Vermeer, M. van Eeten, C. Ga ˜n´an, “Ruling the Rules: Quantifying the Evolution of Rulesets, Alerts and Incidents in Network Intrusion Detection,” inProceedings of the ACM Asia Conference on Computer and Communications Security. Association for Computing Machinery, 2022, pp. 799–814

  20. [20]

    TopVenues: A reproducible corpus and tooling substrate for cybersecurity literature reviews,

    S. Barbieri, ´A. L. R. Ferraz, L. A. Pereira J ´unior, “TopVenues: A reproducible corpus and tooling substrate for cybersecurity literature reviews,” 2026. [Online]. Available: https://arxiv.org/abs/2606.18320

  21. [21]

    CAIDA Anonymized Internet Traces Dataset,

    Center for Applied Internet Data Analysis, “CAIDA Anonymized Internet Traces Dataset,” Dataset documentation, 2020, accessed: 2026- 06-17. [Online]. Available: https://www-old.caida.org/data/passive/ passive dataset.xml

  22. [22]

    MAWI Working Group Traffic Archive,

    MAWI Working Group, “MAWI Working Group Traffic Archive,” Dataset documentation, 2026, accessed: 2026-06-17. [Online]. Available: https://mawi.wide.ad.jp/mawi/

  23. [23]

    A Network Gene-Based Fra- mework for Detecting Advanced Persistent Threats,

    Y . Wang, Y . Wang, J. Liu, Z. Huang, “A Network Gene-Based Fra- mework for Detecting Advanced Persistent Threats,” in2014 Ninth International Conference on P2P , Parallel, Grid, Cloud and Internet Computing. IEEE, 2014, pp. 47–54

  24. [24]

    CSKG4APT: A Cyberse- curity Knowledge Graph for Advanced Persistent Threat Organization Attribution,

    Y . Ren, Y . Xiao, Y . Zhou, Z. Zhang, Z. Tian, “CSKG4APT: A Cyberse- curity Knowledge Graph for Advanced Persistent Threat Organization Attribution,”IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 6, pp. 5695–5709, 2023

  25. [25]

    Devil in the Noise: Detecting Advanced Persistent Threats with Backbone Extraction,

    C. M. C. Viana, C. H. G. Ferreira, F. Murai, A. L. d. Santos, L. A. Pereira J ´unior, “Devil in the Noise: Detecting Advanced Persistent Threats with Backbone Extraction,” in2024 IEEE Symposium on Computers and Communications. IEEE, 2024

  26. [26]

    Systems for Detecting Advanced Persistent Threats: A Development Roadmap Using Intelligent Data Analysis,

    J. de Vries, H. Hoogstraaten, J. van den Berg, S. Daskapan, “Systems for Detecting Advanced Persistent Threats: A Development Roadmap Using Intelligent Data Analysis,” in2012 International Conference on Cyber Security. IEEE, 2012, pp. 54–61

  27. [27]

    Matched and Mismatched SOCs: A Qualitative Study on Security Operations Center Issues,

    F. B. Kokulu, A. Soneji, T. Bao, Y . Shoshitaishvili, Z. Zhao, A. Doup ´e, G.-J. Ahn, “Matched and Mismatched SOCs: A Qualitative Study on Security Operations Center Issues,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security. Association for Computing Machinery, 2019, pp. 1955–1970

  28. [28]

    Do You Play It by the Books? A Study on Incident Response Playbooks and Influencing Factors,

    D. Schlette, P. Empl, M. Caselli, T. Schreck, G. Pernul, “Do You Play It by the Books? A Study on Incident Response Playbooks and Influencing Factors,” in2024 IEEE Symposium on Security and Privacy. IEEE, 2024, pp. 3625–3643

  29. [29]

    Lessons Lost: Incident Response in the Age of Cyber Insurance and Breach Attorneys,

    D. W. Woods, R. B ¨ohme, J. Wolff, D. Schwarcz, “Lessons Lost: Incident Response in the Age of Cyber Insurance and Breach Attorneys,” in32nd USENIX Security Symposium. USENIX Association, 2023, pp. 2259–2273. [Online]. Available: https://www. usenix.org/conference/usenixsecurity23/presentation/woods

  30. [30]

    PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing,

    G. Deng, Y . Liu, V . M. Vilches, P. Liu, Y . Li, Y . Xu, M. Pinzger, S. Rass, T. Zhang, Y . Liu, “PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing,” in33rd USENIX Security Symposium. USENIX Association, 2024. [Online]. Available: https://www.usenix.org/conference/usenixsecurity24/presentation/deng

  31. [31]

    CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models,

    M. Bhatt, S. Chennabasappa, Y . Li, C. Nikolaidis, D. Song, S. Wan, F. Ahmad, C. Aschermann, Y . Chen, D. Kapil, D. Molnar, S. Whitman, J. Saxe, “CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models,” https://arxiv.org/abs/2404.13161, 2024, arXiv preprint

  32. [32]

    Enforceable Security Policies,

    F. B. Schneider, “Enforceable Security Policies,”ACM Transactions on Information and System Security, vol. 3, no. 1, pp. 30–50, 2000

  33. [33]

    Edit Automata: Enforcement Me- chanisms for Run-Time Security Policies,

    J. Ligatti, L. Bauer, D. Walker, “Edit Automata: Enforcement Me- chanisms for Run-Time Security Policies,”International Journal of Information Security, vol. 4, no. 1, pp. 2–16, 2005

  34. [34]

    Progent: Programmable Privilege Control for LLM Agents,

    T. Shi, J. He, Z. Wang, H. Li, L. Wu, W. Guo, D. Song, “Progent: Programmable Privilege Control for LLM Agents,” https://arxiv.org/abs/ 2504.11703, 2025, arXiv preprint

  35. [35]

    AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents,

    H. Wang, C. M. Poskitt, J. Sun, “AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents,” https://arxiv.org/abs/ 2503.18666, 2025, arXiv preprint