arxiv: 2603.23459 · v2 · submitted 2026-03-24 · 💻 cs.CR · cs.LG

CSTS: A Canonical Security Telemetry Substrate for AI-Native Cyber Detection

Abdul Rahman This is my paper

Pith reviewed 2026-05-15 00:24 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords cybersecurity telemetrydata harmonizationcanonical substrateAI-native detectionentity relationsprovenancecyber data normalizationportable analytics

0 comments p. Extension

The pith

The Canonical Security Telemetry Substrate (CSTS) unifies heterogeneous cybersecurity data into a common structure over entities, relations, events, state, and provenance for portable AI analytics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Cybersecurity data arrives in incompatible formats from many vendors, so analytics teams spend most of their time on ingestion and normalization instead of detection. The paper presents CSTS as a reusable substrate that converts this fragmented input into one representation built on persistent entities, typed relations, events, temporal state, and provenance. Explicit mappings and extensible metadata keep source-specific details intact so that downstream models do not lose necessary context. The design is meant to work across on-prem, hybrid, and multi-cloud deployments and to support anomaly detection, graph learning, forecasting, and agentic AI without repeated custom engineering.

Core claim

CSTS is a canonical, AI-ready telemetry foundation designed to harmonize heterogeneous cyber data into a common representation over persistent entities, typed relations, events, temporal state, and provenance, while preserving source-specific nuance through explicit mappings and extensible metadata so that the same models can run across environments.

What carries the argument

The CSTS representational model of persistent entities, typed relations, events, temporal state, and provenance, with explicit mappings and extensible metadata that retain source detail for downstream inference.

If this is right

The same AI models for anomaly detection, graph learning, forecasting, behavior modeling, and agentic response can run on data from any mapped source.
Analytics programs no longer need separate ingestion pipelines for each vendor or deployment environment.
A single substrate supports both on-prem and multi-cloud operation without re-engineering the data layer.
Downstream tasks become model-agnostic because the input representation is fixed and portable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Vendors might begin emitting data directly in CSTS-compatible form to reduce customer integration work.
A shared substrate could enable cross-organization sharing of detection models without exposing raw logs.
Empirical tests could measure whether models trained on CSTS-mapped data retain detection accuracy compared with native formats.

Load-bearing premise

Heterogeneous cyber data from arbitrary vendors can be mapped into the CSTS structure while keeping every detail required for accurate AI inference without unacceptable loss or added complexity.

What would settle it

A concrete example of a vendor log or alert whose critical attributes cannot be expressed in CSTS without omitting information that changes the outcome of an anomaly-detection or graph-learning model.

read the original abstract

Cybersecurity data remains fragmented across vendors, formats, schemas, and deployment environments, forcing AI and analytics programs to spend disproportionate effort on ingestion, normalization, and brittle source-specific engineering. This paper introduces the Canonical Security Telemetry Substrate (CSTS), a canonical, AI-ready telemetry foundation designed to harmonize heterogeneous cyber data into a common representation over persistent entities, typed relations, events, temporal state, and provenance. CSTS is intended to move cybersecurity analytics beyond ad hoc record normalization toward a reusable substrate that supports anomaly detection, graph learning, forecasting, behavior-based modeling, and agentic cyber AI. We formalize the core design principles of CSTS, define its representational components, and explain how it preserves source-specific nuance through explicit mappings and extensible metadata while still enabling portable downstream inference. We further position CSTS as a cloud-agnostic and deployment-agnostic substrate suitable for on-prem, hybrid, and multi-cloud environments. The result is a unifying telemetry model that reduces the blue-collar burden of cyber data engineering and creates a clearer path to scalable, interoperable, and model-agnostic cyber AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CSTS is a conceptual framing for a unified cyber telemetry substrate that names a real data problem but supplies no mappings, examples, or tests to show it works.

read the letter

Colleague, the paper's main contribution is laying out CSTS as a single substrate built on persistent entities, typed relations, events, temporal state, and provenance to reduce the custom engineering that currently dominates cyber AI pipelines. It positions this as cloud-agnostic and usable for anomaly detection, graph learning, and agentic systems. That framing is reasonably clear and directly targets a practical bottleneck in the field. The design principles for keeping source nuance through explicit mappings and extensible metadata are stated without obvious internal contradictions. What is missing is any concrete demonstration. There are no worked mappings from real vendor logs or alerts, no measure of information loss or added complexity, and no downstream evaluation showing that models actually need less source-specific work or perform better on the new structure. The central promise therefore rests on an untested assumption that the five components can absorb heterogeneous data without reintroducing the very engineering overhead the substrate is meant to eliminate. This paper is aimed at researchers and engineers who care about data standards and infrastructure for cybersecurity ML. A reader looking for new methods or validated results will find little to use directly, but someone organizing a workshop on telemetry models might find the component breakdown useful as a starting point for discussion. I would send it to peer review on the condition that the authors add at least one or two detailed mapping examples and a small-scale validation in revision; without that it stays too preliminary for a full paper.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Canonical Security Telemetry Substrate (CSTS), a conceptual design for a canonical, AI-ready telemetry foundation that harmonizes heterogeneous cybersecurity data into a common representation built on persistent entities, typed relations, events, temporal state, and provenance. It formalizes design principles, defines representational components, explains preservation of source-specific nuance via explicit mappings and extensible metadata, and positions CSTS as cloud- and deployment-agnostic to support anomaly detection, graph learning, forecasting, and agentic cyber AI while reducing data engineering effort.

Significance. If the proposed mappings can be shown to preserve necessary nuance with acceptable complexity, CSTS could meaningfully address data fragmentation in cybersecurity and enable more portable, model-agnostic AI applications. The conceptual framing targets a genuine practical bottleneck, but the absence of any concrete mappings, loss quantification, or validation means the significance is prospective rather than demonstrated.

major comments (2)

[Abstract] Abstract and introduction: the central claim that heterogeneous vendor data can be mapped to CSTS components while preserving all necessary source-specific nuance for downstream AI inference (anomaly detection, graph learning, forecasting) rests entirely on stated design principles with no worked examples, no explicit mapping definitions, and no analysis of information loss or added complexity.
[Representational components] Representational components section: the definitions of persistent entities, typed relations, events, temporal state, and provenance are presented at a high level without formal syntax, axioms, or completeness arguments, leaving open whether the substrate is sufficiently expressive for arbitrary cyber telemetry without reintroducing source-specific engineering downstream.

minor comments (2)

[Abstract] The abstract could include a short statement on the intended scope of the formalization and any acknowledged limitations of the conceptual approach.
[Introduction] Terminology such as 'blue-collar burden' is informal for a journal submission and could be replaced with more precise language about engineering overhead.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the presentation while preserving the conceptual focus of the work.

read point-by-point responses

Referee: [Abstract] Abstract and introduction: the central claim that heterogeneous vendor data can be mapped to CSTS components while preserving all necessary source-specific nuance for downstream AI inference (anomaly detection, graph learning, forecasting) rests entirely on stated design principles with no worked examples, no explicit mapping definitions, and no analysis of information loss or added complexity.

Authors: We agree that the manuscript would be strengthened by concrete illustrations of the mapping process. Although the paper is primarily a conceptual contribution focused on design principles, we will add a new subsection with worked examples mapping representative heterogeneous sources (e.g., network flow records, endpoint telemetry, and SIEM alerts) to CSTS persistent entities, typed relations, events, and provenance. The revision will include explicit mapping rules, discussion of how source-specific nuance is retained via extensible metadata, and a qualitative analysis of information loss and added complexity. This addresses the concern directly. revision: yes
Referee: [Representational components] Representational components section: the definitions of persistent entities, typed relations, events, temporal state, and provenance are presented at a high level without formal syntax, axioms, or completeness arguments, leaving open whether the substrate is sufficiently expressive for arbitrary cyber telemetry without reintroducing source-specific engineering downstream.

Authors: The high-level presentation was chosen to emphasize reusability across deployments. We will revise the section to include a lightweight formal syntax (using tuple-based notation for entities, relations, and events with type constraints), a small set of consistency axioms (e.g., temporal ordering and provenance chaining), and a completeness argument demonstrating coverage of standard cyber telemetry categories. The revision will also clarify that the explicit mapping layer and metadata extensibility are designed to avoid reintroducing source-specific engineering in downstream AI tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity: definitional framework without derivations or self-referential reductions

full rationale

The paper introduces CSTS by defining its representational components (persistent entities, typed relations, events, temporal state, provenance) and design principles for harmonizing heterogeneous data. No equations, fitted parameters, or predictions appear in the provided text. No self-citations are invoked as load-bearing justifications for uniqueness or ansatzes. The central claim is a proposal for a canonical substrate that preserves nuance via explicit mappings and extensible metadata; this does not reduce to its own inputs by construction, as the framework is presented as an organizing definition rather than a derived result from prior fitted data or author-specific theorems. The lack of concrete mappings is a validation gap but does not create circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on domain assumptions about data harmonization feasibility rather than new fitted parameters or invented physical entities.

axioms (1)

domain assumption Heterogeneous cyber data sources can be mapped to a common set of entities, relations, events, temporal states, and provenance without unacceptable loss of information.
Invoked throughout the abstract as the basis for the canonical representation.

invented entities (1)

CSTS substrate no independent evidence
purpose: Unified canonical representation for cyber telemetry
Newly defined model introduced in the paper; no independent evidence provided.

pith-pipeline@v0.9.0 · 5488 in / 1183 out tokens · 31782 ms · 2026-05-15T00:24:26.234636+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Zero day threat detection using graph and flow based security telemetry,

C. Redino, D. Nandakumar, R. Schiller, K. Choi, A. Rahman, E. Bowen, M. Weeks, A. Shaha, and J. Nehila, “Zero day threat detection using graph and flow based security telemetry,” 2022. [Online]. Available: https://arxiv.org/abs/2205.02298

work page arXiv 2022
[2]

Zero day threat detection using metric learning autoencoders,

D. Nandakumar, R. Schiller, C. Redino, K. Choi, A. Rahman, E. Bowen, M. Vucovich, J. Nehila, M. Weeks, and A. Shaha, “Zero day threat detection using metric learning autoencoders,” 2022. [Online]. Available: https://arxiv.org/abs/2211.00441

work page arXiv 2022
[3]

Cross-temporal detection of novel ransomware campaigns: A multi-modal alert approach,

S. Murli, D. Nandakumar, P. K. Kushwaha, C. Wang, C. Redino, A. Rahman, S. Israni, T. Singh, and E. Bowen, “Cross-temporal detection of novel ransomware campaigns: A multi-modal alert approach,” 2023. [Online]. Available: https://arxiv.org/abs/2309.00700

work page arXiv 2023
[4]

Lateral movement detection using user behavioral analysis,

D. Kushwaha, D. Nandakumar, A. Kakkar, S. Gupta, K. Choi, C. Redino, A. Rahman, S. S. Chandramohan, E. Bowen, M. Weeks, A. Shaha, and J. Nehila, “Lateral movement detection using user behavioral analysis,” 2022. [Online]. Available: https://arxiv.org/abs/2208.13524

work page arXiv 2022
[5]

Open cybersecurity schema framework (ocsf),

Open Cybersecurity Schema Framework, “Open cybersecurity schema framework (ocsf),” https://ocsf.io, 2023, accessed 2026

work page 2023
[6]

Data security architecture considerations for telemetry post processing environments,

J. Kalibjian, “Data security architecture considerations for telemetry post processing environments,” inInternational Telemetering Confer- ence Proceedings. International Foundation for Telemetering, 2017

work page 2017
[7]

Telemetry networks cyber security architecture,

W. Zegeye and M. Odejobi, “Telemetry networks cyber security architecture,” inInternational Telemetering Conference Proceedings. International Foundation for Telemetering, 2022

work page 2022
[8]

Cyber security architecture for networked telemetry,

R. Dean, W. Akpose, W. Zegeye, and F. Moazzami, “Cyber security architecture for networked telemetry,” inInternational Telemetering Conference Proceedings. International Foundation for Telemetering, 2024

work page 2024
[9]

Elastic common schema (ecs),

Elastic, “Elastic common schema (ecs),” https://www.elastic.co/guide/en/ecs/current/index.html, 2023, accessed 2026

work page 2023
[10]

Opentelemetry,

Cloud Native Computing Foundation, “Opentelemetry,” https://opentelemetry.io, 2023, accessed 2026

work page 2023
[11]

Leveraging opentelemetry for enhanced application security through telemetry data,

L. P. Rongali, “Leveraging opentelemetry for enhanced application security through telemetry data,” TechRxiv preprint, 2025, dOI: 10.36227/techrxiv.175790707.71761473/v1. [Online]. Available: https://doi.org/10.36227/techrxiv.175790707.71761473/v1

work page doi:10.36227/techrxiv.175790707.71761473/v1 2025
[12]

Towards an open format for scalable system telemetry,

T. Taylor, F. Araujo, and X. Shu, “Towards an open format for scalable system telemetry,” in2020 IEEE International Conference on Big Data (Big Data). IEEE, 2020, pp. 1031–1040, arXiv:2101.10474. [Online]. Available: https://arxiv.org/abs/2101.10474

work page arXiv 2020
[13]

Advanced intrusion detection in telemetry enterprise networks,

F. Okonkwo, “Advanced intrusion detection in telemetry enterprise networks,” inInternational Telemetering Conference Proceedings, vol. 59. International Foundation for Telemetering, 2024, final published version; available via UA Campus Repository. [Online]. Available: http://hdl.handle.net/10150/675420

work page 2024
[14]

Learning in multiple spaces: Few-shot network attack detection with metric-fused prototypical networks,

F. Martinez-Lopez, L. Santana, and M. Rahouti, “Learning in multiple spaces: Few-shot network attack detection with metric-fused prototypical networks,” 2024. [Online]. Available: https://arxiv.org/abs/2501.00050

work page arXiv 2024
[15]

Self-supervised transformer- based contrastive learning for intrusion detection systems,

I. Koukoulis, I. Syrigos, and T. Korakis, “Self-supervised transformer- based contrastive learning for intrusion detection systems,” 2025. [Online]. Available: https://arxiv.org/abs/2505.08816

work page arXiv 2025
[16]

Open cyber threat intelligence knowledge graph,

I. Sarhanet al., “Open cyber threat intelligence knowledge graph,” Information Sciences, vol. 578, p. 123456, 2021, constructs a cyber threat intelligence knowledge graph from unstructured APT reports and neural entity/relation extraction models

work page 2021
[17]

Cybersecurity knowledge graphs: Representing and rea- soning about complex security relationships,

L. F. Sikos, “Cybersecurity knowledge graphs: Representing and rea- soning about complex security relationships,”Applied Soft Computing, vol. 132, p. 110234, 2023, survey of cybersecurity knowledge graph methods, reasoning, and applications

work page 2023
[18]

Knowledge graph reasoning for cyber attack detection,

E. Gilliard, J. Liu, and A. A. Aliyu, “Knowledge graph reasoning for cyber attack detection,”IET Communications, vol. 18, no. 6, pp. 297– 308, 2024, graph reasoning enhances detection by inferring semantic attack relationships

work page 2024
[19]

Knowgraph: Knowledge-enabled anomaly detection via graph-embedded reasoning,

A. Zhouet al., “Knowgraph: Knowledge-enabled anomaly detection via graph-embedded reasoning,” 2024. [Online]. Available: https://arxiv.org/abs/2410.08390

work page arXiv 2024
[20]

SETC: A vulnerability telemetry collection framework,

R. Holeman, J. Hastings, and V . M. Vaidyan, “SETC: A vulnerability telemetry collection framework,” arXiv preprint, 2024, arXiv:2406.05942. [Online]. Available: https://arxiv.org/abs/2406.05942

work page arXiv 2024
[21]

Evolving cybersecurity frontiers: A comprehensive survey on concept drift and feature dynamics aware machine and deep learning in intrusion detection systems,

M. A. Shyaa, N. F. Ibrahim, Z. Zainol, R. Abdullah, M. Anbar, and L. Alzubaidi, “Evolving cybersecurity frontiers: A comprehensive survey on concept drift and feature dynamics aware machine and deep learning in intrusion detection systems,”Engineering Applications of Artificial Intelligence, vol. 137, p. 109143, 2024. [Online]. Available: https://www.scie...

work page 2024
[22]

One or two things we know about concept drift—a survey on monitoring in evolving environments. part a: detecting concept drift,

F. Hinder, V . Vaquet, and B. Hammer, “One or two things we know about concept drift—a survey on monitoring in evolving environments. part a: detecting concept drift,”Frontiers in Artificial Intelligence, vol. 7, p. 1330257, 2024. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC11220237/

work page 2024
[23]

Eg-conmix: An intrusion detection method based on graph contrastive learning,

L. Wu, S. Lei, F. Liao, Y . Zheng, Y . Liu, W. Fu, H. Song, and J. Zhou, “Eg-conmix: An intrusion detection method based on graph contrastive learning,” 2024. [Online]. Available: https://arxiv.org/abs/2403.17980

work page arXiv 2024
[24]

A novel contrastive loss for zero-day network intrusion detection,

J. Wilkie, H. Hindy, C. Michie, C. Tachtatzis, J. Irvine, and R. Atkinson, “A novel contrastive loss for zero-day network intrusion detection,” 2026. [Online]. Available: https://arxiv.org/abs/2601.09902

work page arXiv 2026
[25]

Supervised contrastive learning,

P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” inAdvances in Neural Information Processing Systems, 2020

work page 2020
[26]

Anomaly detection using autoencoders with nonlinear dimensionality reduction,

M. Sakurada and T. Yairi, “Anomaly detection using autoencoders with nonlinear dimensionality reduction,” inProceedings of the MLSDA 2014 Workshop, 2014

work page 2014
[27]

Network motifs: Simple building blocks of complex net- works,

R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network motifs: Simple building blocks of complex net- works,”Science, vol. 298, no. 5594, pp. 824–827, 2002

work page 2002