pith. the verified trust layer for science. sign in

arxiv: 2603.23459 · v2 · submitted 2026-03-24 · 💻 cs.CR · cs.LG

CSTS: A Canonical Security Telemetry Substrate for AI-Native Cyber Detection

Pith reviewed 2026-05-15 00:24 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords cybersecurity telemetrydata harmonizationcanonical substrateAI-native detectionentity relationsprovenancecyber data normalizationportable analytics
0
0 comments X p. Extension

The pith

The Canonical Security Telemetry Substrate (CSTS) unifies heterogeneous cybersecurity data into a common structure over entities, relations, events, state, and provenance for portable AI analytics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Cybersecurity data arrives in incompatible formats from many vendors, so analytics teams spend most of their time on ingestion and normalization instead of detection. The paper presents CSTS as a reusable substrate that converts this fragmented input into one representation built on persistent entities, typed relations, events, temporal state, and provenance. Explicit mappings and extensible metadata keep source-specific details intact so that downstream models do not lose necessary context. The design is meant to work across on-prem, hybrid, and multi-cloud deployments and to support anomaly detection, graph learning, forecasting, and agentic AI without repeated custom engineering.

Core claim

CSTS is a canonical, AI-ready telemetry foundation designed to harmonize heterogeneous cyber data into a common representation over persistent entities, typed relations, events, temporal state, and provenance, while preserving source-specific nuance through explicit mappings and extensible metadata so that the same models can run across environments.

What carries the argument

The CSTS representational model of persistent entities, typed relations, events, temporal state, and provenance, with explicit mappings and extensible metadata that retain source detail for downstream inference.

If this is right

  • The same AI models for anomaly detection, graph learning, forecasting, behavior modeling, and agentic response can run on data from any mapped source.
  • Analytics programs no longer need separate ingestion pipelines for each vendor or deployment environment.
  • A single substrate supports both on-prem and multi-cloud operation without re-engineering the data layer.
  • Downstream tasks become model-agnostic because the input representation is fixed and portable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Vendors might begin emitting data directly in CSTS-compatible form to reduce customer integration work.
  • A shared substrate could enable cross-organization sharing of detection models without exposing raw logs.
  • Empirical tests could measure whether models trained on CSTS-mapped data retain detection accuracy compared with native formats.

Load-bearing premise

Heterogeneous cyber data from arbitrary vendors can be mapped into the CSTS structure while keeping every detail required for accurate AI inference without unacceptable loss or added complexity.

What would settle it

A concrete example of a vendor log or alert whose critical attributes cannot be expressed in CSTS without omitting information that changes the outcome of an anomaly-detection or graph-learning model.

read the original abstract

Cybersecurity data remains fragmented across vendors, formats, schemas, and deployment environments, forcing AI and analytics programs to spend disproportionate effort on ingestion, normalization, and brittle source-specific engineering. This paper introduces the Canonical Security Telemetry Substrate (CSTS), a canonical, AI-ready telemetry foundation designed to harmonize heterogeneous cyber data into a common representation over persistent entities, typed relations, events, temporal state, and provenance. CSTS is intended to move cybersecurity analytics beyond ad hoc record normalization toward a reusable substrate that supports anomaly detection, graph learning, forecasting, behavior-based modeling, and agentic cyber AI. We formalize the core design principles of CSTS, define its representational components, and explain how it preserves source-specific nuance through explicit mappings and extensible metadata while still enabling portable downstream inference. We further position CSTS as a cloud-agnostic and deployment-agnostic substrate suitable for on-prem, hybrid, and multi-cloud environments. The result is a unifying telemetry model that reduces the blue-collar burden of cyber data engineering and creates a clearer path to scalable, interoperable, and model-agnostic cyber AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Canonical Security Telemetry Substrate (CSTS), a conceptual design for a canonical, AI-ready telemetry foundation that harmonizes heterogeneous cybersecurity data into a common representation built on persistent entities, typed relations, events, temporal state, and provenance. It formalizes design principles, defines representational components, explains preservation of source-specific nuance via explicit mappings and extensible metadata, and positions CSTS as cloud- and deployment-agnostic to support anomaly detection, graph learning, forecasting, and agentic cyber AI while reducing data engineering effort.

Significance. If the proposed mappings can be shown to preserve necessary nuance with acceptable complexity, CSTS could meaningfully address data fragmentation in cybersecurity and enable more portable, model-agnostic AI applications. The conceptual framing targets a genuine practical bottleneck, but the absence of any concrete mappings, loss quantification, or validation means the significance is prospective rather than demonstrated.

major comments (2)
  1. [Abstract] Abstract and introduction: the central claim that heterogeneous vendor data can be mapped to CSTS components while preserving all necessary source-specific nuance for downstream AI inference (anomaly detection, graph learning, forecasting) rests entirely on stated design principles with no worked examples, no explicit mapping definitions, and no analysis of information loss or added complexity.
  2. [Representational components] Representational components section: the definitions of persistent entities, typed relations, events, temporal state, and provenance are presented at a high level without formal syntax, axioms, or completeness arguments, leaving open whether the substrate is sufficiently expressive for arbitrary cyber telemetry without reintroducing source-specific engineering downstream.
minor comments (2)
  1. [Abstract] The abstract could include a short statement on the intended scope of the formalization and any acknowledged limitations of the conceptual approach.
  2. [Introduction] Terminology such as 'blue-collar burden' is informal for a journal submission and could be replaced with more precise language about engineering overhead.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the presentation while preserving the conceptual focus of the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract and introduction: the central claim that heterogeneous vendor data can be mapped to CSTS components while preserving all necessary source-specific nuance for downstream AI inference (anomaly detection, graph learning, forecasting) rests entirely on stated design principles with no worked examples, no explicit mapping definitions, and no analysis of information loss or added complexity.

    Authors: We agree that the manuscript would be strengthened by concrete illustrations of the mapping process. Although the paper is primarily a conceptual contribution focused on design principles, we will add a new subsection with worked examples mapping representative heterogeneous sources (e.g., network flow records, endpoint telemetry, and SIEM alerts) to CSTS persistent entities, typed relations, events, and provenance. The revision will include explicit mapping rules, discussion of how source-specific nuance is retained via extensible metadata, and a qualitative analysis of information loss and added complexity. This addresses the concern directly. revision: yes

  2. Referee: [Representational components] Representational components section: the definitions of persistent entities, typed relations, events, temporal state, and provenance are presented at a high level without formal syntax, axioms, or completeness arguments, leaving open whether the substrate is sufficiently expressive for arbitrary cyber telemetry without reintroducing source-specific engineering downstream.

    Authors: The high-level presentation was chosen to emphasize reusability across deployments. We will revise the section to include a lightweight formal syntax (using tuple-based notation for entities, relations, and events with type constraints), a small set of consistency axioms (e.g., temporal ordering and provenance chaining), and a completeness argument demonstrating coverage of standard cyber telemetry categories. The revision will also clarify that the explicit mapping layer and metadata extensibility are designed to avoid reintroducing source-specific engineering in downstream AI tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity: definitional framework without derivations or self-referential reductions

full rationale

The paper introduces CSTS by defining its representational components (persistent entities, typed relations, events, temporal state, provenance) and design principles for harmonizing heterogeneous data. No equations, fitted parameters, or predictions appear in the provided text. No self-citations are invoked as load-bearing justifications for uniqueness or ansatzes. The central claim is a proposal for a canonical substrate that preserves nuance via explicit mappings and extensible metadata; this does not reduce to its own inputs by construction, as the framework is presented as an organizing definition rather than a derived result from prior fitted data or author-specific theorems. The lack of concrete mappings is a validation gap but does not create circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on domain assumptions about data harmonization feasibility rather than new fitted parameters or invented physical entities.

axioms (1)
  • domain assumption Heterogeneous cyber data sources can be mapped to a common set of entities, relations, events, temporal states, and provenance without unacceptable loss of information.
    Invoked throughout the abstract as the basis for the canonical representation.
invented entities (1)
  • CSTS substrate no independent evidence
    purpose: Unified canonical representation for cyber telemetry
    Newly defined model introduced in the paper; no independent evidence provided.

pith-pipeline@v0.9.0 · 5488 in / 1183 out tokens · 31782 ms · 2026-05-15T00:24:26.234636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Zero day threat detection using graph and flow based security telemetry,

    C. Redino, D. Nandakumar, R. Schiller, K. Choi, A. Rahman, E. Bowen, M. Weeks, A. Shaha, and J. Nehila, “Zero day threat detection using graph and flow based security telemetry,” 2022. [Online]. Available: https://arxiv.org/abs/2205.02298

  2. [2]

    Zero day threat detection using metric learning autoencoders,

    D. Nandakumar, R. Schiller, C. Redino, K. Choi, A. Rahman, E. Bowen, M. Vucovich, J. Nehila, M. Weeks, and A. Shaha, “Zero day threat detection using metric learning autoencoders,” 2022. [Online]. Available: https://arxiv.org/abs/2211.00441

  3. [3]

    Cross-temporal detection of novel ransomware campaigns: A multi-modal alert approach,

    S. Murli, D. Nandakumar, P. K. Kushwaha, C. Wang, C. Redino, A. Rahman, S. Israni, T. Singh, and E. Bowen, “Cross-temporal detection of novel ransomware campaigns: A multi-modal alert approach,” 2023. [Online]. Available: https://arxiv.org/abs/2309.00700

  4. [4]

    Lateral movement detection using user behavioral analysis,

    D. Kushwaha, D. Nandakumar, A. Kakkar, S. Gupta, K. Choi, C. Redino, A. Rahman, S. S. Chandramohan, E. Bowen, M. Weeks, A. Shaha, and J. Nehila, “Lateral movement detection using user behavioral analysis,” 2022. [Online]. Available: https://arxiv.org/abs/2208.13524

  5. [5]

    Open cybersecurity schema framework (ocsf),

    Open Cybersecurity Schema Framework, “Open cybersecurity schema framework (ocsf),” https://ocsf.io, 2023, accessed 2026

  6. [6]

    Data security architecture considerations for telemetry post processing environments,

    J. Kalibjian, “Data security architecture considerations for telemetry post processing environments,” inInternational Telemetering Confer- ence Proceedings. International Foundation for Telemetering, 2017

  7. [7]

    Telemetry networks cyber security architecture,

    W. Zegeye and M. Odejobi, “Telemetry networks cyber security architecture,” inInternational Telemetering Conference Proceedings. International Foundation for Telemetering, 2022

  8. [8]

    Cyber security architecture for networked telemetry,

    R. Dean, W. Akpose, W. Zegeye, and F. Moazzami, “Cyber security architecture for networked telemetry,” inInternational Telemetering Conference Proceedings. International Foundation for Telemetering, 2024

  9. [9]

    Elastic common schema (ecs),

    Elastic, “Elastic common schema (ecs),” https://www.elastic.co/guide/en/ecs/current/index.html, 2023, accessed 2026

  10. [10]

    Opentelemetry,

    Cloud Native Computing Foundation, “Opentelemetry,” https://opentelemetry.io, 2023, accessed 2026

  11. [11]

    Leveraging opentelemetry for enhanced application security through telemetry data,

    L. P. Rongali, “Leveraging opentelemetry for enhanced application security through telemetry data,” TechRxiv preprint, 2025, dOI: 10.36227/techrxiv.175790707.71761473/v1. [Online]. Available: https://doi.org/10.36227/techrxiv.175790707.71761473/v1

  12. [12]

    Towards an open format for scalable system telemetry,

    T. Taylor, F. Araujo, and X. Shu, “Towards an open format for scalable system telemetry,” in2020 IEEE International Conference on Big Data (Big Data). IEEE, 2020, pp. 1031–1040, arXiv:2101.10474. [Online]. Available: https://arxiv.org/abs/2101.10474

  13. [13]

    Advanced intrusion detection in telemetry enterprise networks,

    F. Okonkwo, “Advanced intrusion detection in telemetry enterprise networks,” inInternational Telemetering Conference Proceedings, vol. 59. International Foundation for Telemetering, 2024, final published version; available via UA Campus Repository. [Online]. Available: http://hdl.handle.net/10150/675420

  14. [14]

    Learning in multiple spaces: Few-shot network attack detection with metric-fused prototypical networks,

    F. Martinez-Lopez, L. Santana, and M. Rahouti, “Learning in multiple spaces: Few-shot network attack detection with metric-fused prototypical networks,” 2024. [Online]. Available: https://arxiv.org/abs/2501.00050

  15. [15]

    Self-supervised transformer- based contrastive learning for intrusion detection systems,

    I. Koukoulis, I. Syrigos, and T. Korakis, “Self-supervised transformer- based contrastive learning for intrusion detection systems,” 2025. [Online]. Available: https://arxiv.org/abs/2505.08816

  16. [16]

    Open cyber threat intelligence knowledge graph,

    I. Sarhanet al., “Open cyber threat intelligence knowledge graph,” Information Sciences, vol. 578, p. 123456, 2021, constructs a cyber threat intelligence knowledge graph from unstructured APT reports and neural entity/relation extraction models

  17. [17]

    Cybersecurity knowledge graphs: Representing and rea- soning about complex security relationships,

    L. F. Sikos, “Cybersecurity knowledge graphs: Representing and rea- soning about complex security relationships,”Applied Soft Computing, vol. 132, p. 110234, 2023, survey of cybersecurity knowledge graph methods, reasoning, and applications

  18. [18]

    Knowledge graph reasoning for cyber attack detection,

    E. Gilliard, J. Liu, and A. A. Aliyu, “Knowledge graph reasoning for cyber attack detection,”IET Communications, vol. 18, no. 6, pp. 297– 308, 2024, graph reasoning enhances detection by inferring semantic attack relationships

  19. [19]

    Knowgraph: Knowledge-enabled anomaly detection via graph-embedded reasoning,

    A. Zhouet al., “Knowgraph: Knowledge-enabled anomaly detection via graph-embedded reasoning,” 2024. [Online]. Available: https://arxiv.org/abs/2410.08390

  20. [20]

    SETC: A vulnerability telemetry collection framework,

    R. Holeman, J. Hastings, and V . M. Vaidyan, “SETC: A vulnerability telemetry collection framework,” arXiv preprint, 2024, arXiv:2406.05942. [Online]. Available: https://arxiv.org/abs/2406.05942

  21. [21]

    Evolving cybersecurity frontiers: A comprehensive survey on concept drift and feature dynamics aware machine and deep learning in intrusion detection systems,

    M. A. Shyaa, N. F. Ibrahim, Z. Zainol, R. Abdullah, M. Anbar, and L. Alzubaidi, “Evolving cybersecurity frontiers: A comprehensive survey on concept drift and feature dynamics aware machine and deep learning in intrusion detection systems,”Engineering Applications of Artificial Intelligence, vol. 137, p. 109143, 2024. [Online]. Available: https://www.scie...

  22. [22]

    One or two things we know about concept drift—a survey on monitoring in evolving environments. part a: detecting concept drift,

    F. Hinder, V . Vaquet, and B. Hammer, “One or two things we know about concept drift—a survey on monitoring in evolving environments. part a: detecting concept drift,”Frontiers in Artificial Intelligence, vol. 7, p. 1330257, 2024. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC11220237/

  23. [23]

    Eg-conmix: An intrusion detection method based on graph contrastive learning,

    L. Wu, S. Lei, F. Liao, Y . Zheng, Y . Liu, W. Fu, H. Song, and J. Zhou, “Eg-conmix: An intrusion detection method based on graph contrastive learning,” 2024. [Online]. Available: https://arxiv.org/abs/2403.17980

  24. [24]

    A novel contrastive loss for zero-day network intrusion detection,

    J. Wilkie, H. Hindy, C. Michie, C. Tachtatzis, J. Irvine, and R. Atkinson, “A novel contrastive loss for zero-day network intrusion detection,” 2026. [Online]. Available: https://arxiv.org/abs/2601.09902

  25. [25]

    Supervised contrastive learning,

    P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” inAdvances in Neural Information Processing Systems, 2020

  26. [26]

    Anomaly detection using autoencoders with nonlinear dimensionality reduction,

    M. Sakurada and T. Yairi, “Anomaly detection using autoencoders with nonlinear dimensionality reduction,” inProceedings of the MLSDA 2014 Workshop, 2014

  27. [27]

    Network motifs: Simple building blocks of complex net- works,

    R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network motifs: Simple building blocks of complex net- works,”Science, vol. 298, no. 5594, pp. 824–827, 2002