CALIBURN: Operationally Calibrated Streaming Intrusion Detection with Regime-Dependent Conformal Risk Control
Pith reviewed 2026-06-30 12:47 UTC · model grok-4.3
The pith
Streaming intrusion detection can set its alerting threshold directly from operator budgets and costs via conformal risk control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The behaviour of calibration and conformal risk control is strongly regime-dependent across attack prevalence. Across three regimes -- LITNET-2020 (5.2%), CICIDS2017 (22%), UNSW-NB15 (64%) -- CALIBURN reaches AUC-PR 0.943 in the rare-attack regime it targets, beating the best streaming baseline by 2.21x and the best batch reference by 4.12x, with isotonic calibration cutting Brier score 30%; it stays strongest among streaming methods at moderate prevalence; and all converge to the prevalence floor under base-rate inversion.
What carries the argument
The Conformal Risk Control wrapper that maps a pre-specified alert budget alpha to a false-positive-bounded threshold under the exchangeability assumption.
If this is right
- In rare-attack streams the integrated pipeline maintains high detection performance.
- At moderate prevalence it remains the strongest among streaming methods.
- Under high prevalence performance of all methods approaches the base-rate floor.
- The high-prevalence collapse is intrinsic to streaming rather than a dataset artifact.
- CRC overshoot of 2B/(n0+1) and empirical-density degeneracy limit conformal alerting at very small alpha.
Where Pith is reading between the lines
- The observed regime dependence may appear in other streaming binary classification tasks that use conformal wrappers.
- The two proposed pre-deployment checks for CRC failure could be applied to conformal methods outside security.
- Adjusting the change-point detection layer might extend the pipeline to domains with different temporal structure.
Load-bearing premise
The data stream must satisfy the exchangeability assumption so the conformal risk control bound holds.
What would settle it
Run the pipeline on a new stream where successive observations are temporally dependent and check whether the realized false-positive rate exceeds the bound promised by the chosen alpha.
Figures
read the original abstract
Streaming intrusion detection systems must process flows continuously under bounded memory, yet most leave alerting-threshold selection as a post-hoc tuning problem incompatible with production, where operators commit in advance to alert budgets, misclassification costs, and Service Level Objectives. We present CALIBURN, a streaming alerting pipeline that derives its decision threshold from these operational inputs rather than a label-dependent search. CALIBURN composes five layers on one streaming substrate: truncated Bayesian online change-point detection; isotonic calibration of the posterior to a conditional attack probability; cost-sensitive thresholding from operator costs; a Conformal Risk Control (CRC) wrapper mapping an alert budget alpha to a false-positive-bounded threshold under exchangeability; and multi-window burn-rate alerting from Site Reliability Engineering. Each layer is established; the contribution is the integration and a falsifiable finding about it: the behaviour of calibration and conformal risk control is strongly regime-dependent across attack prevalence. Across three regimes -- LITNET-2020 (5.2%), CICIDS2017 (22%), UNSW-NB15 (64%) -- CALIBURN reaches AUC-PR 0.943 in the rare-attack regime it targets, beating the best streaming baseline by 2.21x and the best batch reference by 4.12x, with isotonic calibration cutting Brier score 30%; it stays strongest among streaming methods at moderate prevalence; and all converge to the prevalence floor under base-rate inversion. A TTL-feature ablation shows this high-prevalence collapse is intrinsic to streaming, not a dataset artifact. We further identify two mechanisms -- a theoretical CRC overshoot 2B/(n0+1) and an empirical-density degeneracy -- collapsing conformal alerting at very small alpha, and propose both as pre-deployment checks. Code and artifacts: Apache 2.0, Zenodo DOI 10.5281/zenodo.20074590.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CALIBURN, a five-layer streaming IDS pipeline (truncated Bayesian online change-point detection, isotonic calibration to conditional attack probability, cost-sensitive thresholding, CRC wrapper for alert-budget alpha under exchangeability, and multi-window burn-rate alerting) that derives thresholds from operational inputs rather than post-hoc search. It reports regime-dependent empirical behavior across three datasets with attack prevalences 5.2% (LITNET-2020), 22% (CICIDS2017), and 64% (UNSW-NB15), achieving AUC-PR 0.943 in the low-prevalence regime (2.21x over best streaming baseline, 4.12x over best batch reference, 30% Brier reduction from isotonic calibration), with all methods converging at high prevalence; it also identifies CRC overshoot 2B/(n0+1) and density degeneracy as failure modes at small alpha. Code and artifacts are released.
Significance. If the empirical results and CRC guarantees hold under the stated conditions, the work offers a practical, operator-driven alternative to label-dependent threshold tuning in streaming IDS, with the regime-dependent finding and pre-deployment checks providing falsifiable guidance. The public code release and Zenodo artifacts strengthen reproducibility.
major comments (1)
- [CRC wrapper and exchangeability assumption] CRC wrapper description (abstract and methods): the false-positive bound for alert budget alpha is stated to hold 'under exchangeability' for the streaming substrate, yet network flows exhibit autocorrelation, concept drift, and non-stationarity that typically violate exchangeability. No blocking, time-series conformal adjustments, or dependence-robust bounds are described, which directly undermines the central claim that the pipeline 'derives its decision threshold from these operational inputs' via CRC.
minor comments (2)
- [Ablation study] The TTL-feature ablation is cited to show high-prevalence collapse is intrinsic to streaming, but the precise definition of the TTL features and the windowing used in the ablation are not detailed enough to allow independent verification of the claim.
- [CRC failure modes] The theoretical CRC overshoot formula 2B/(n0+1) is presented as a pre-deployment check, but the derivation of B and n0 in the streaming context is not expanded.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the exchangeability assumption. We agree that this is a substantive limitation and will revise the manuscript to clarify the scope of the CRC guarantee and its implications for the central claim.
read point-by-point responses
-
Referee: [CRC wrapper and exchangeability assumption] CRC wrapper description (abstract and methods): the false-positive bound for alert budget alpha is stated to hold 'under exchangeability' for the streaming substrate, yet network flows exhibit autocorrelation, concept drift, and non-stationarity that typically violate exchangeability. No blocking, time-series conformal adjustments, or dependence-robust bounds are described, which directly undermines the central claim that the pipeline 'derives its decision threshold from these operational inputs' via CRC.
Authors: We thank the referee for this observation. The manuscript states the CRC bound holds 'under exchangeability' (Section 4.4 and abstract). We acknowledge that streaming network flows violate exchangeability through autocorrelation, concept drift, and non-stationarity, and that the paper introduces no blocking, time-series conformal adjustments, or dependence-robust bounds. This is a genuine limitation: the CRC layer supplies an operational threshold only when the assumption holds approximately, and the central claim is therefore qualified rather than unconditional. We will revise the manuscript to (i) state the limitation more explicitly in the abstract, methods, and discussion, (ii) add the exchangeability violation to the list of pre-deployment checks alongside CRC overshoot and density degeneracy, and (iii) note potential future mitigations such as blocking or dependence-robust conformal methods. No new technical development is claimed in the current work. revision: yes
Circularity Check
No circularity; empirical integration of established layers on public data
full rationale
The paper composes five established layers (Bayesian change-point detection, isotonic calibration, cost-sensitive thresholding, CRC under exchangeability, burn-rate alerting) and reports regime-dependent empirical behavior on three public datasets (LITNET-2020, CICIDS2017, UNSW-NB15) with released code. The central claim is an observation about AUC-PR, Brier score, and prevalence effects, not a derivation that reduces by the paper's equations to quantities defined only in terms of its own fitted parameters or self-citations. The exchangeability assumption for CRC is stated as an external precondition rather than derived internally. No load-bearing step matches any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
free parameters (2)
- alert budget alpha
- misclassification costs
axioms (1)
- domain assumption Exchangeability of the data stream for the Conformal Risk Control wrapper
Reference graph
Works this paper leans on
-
[1]
Bayesian Online Changepoint Detection
Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742 . Alahmadi, B.A., Axon, L., Martinovic, I.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Confor- mal risk control, in: International Conference on Learning Representations (ICLR). ArXiv:2208.02814. Arp, D., Quiring, E., Pendlebury, F., Warnecke, A., Pierazzi, F., Wressneg- ger, C., Cavallaro, L., Rieck, K.,
-
[3]
Annals of Statistics 51, 816–845
Confor- mal prediction beyond exchangeability. Annals of Statistics 51, 816–845. doi:10.1214/23-AOS2276. Bates, S., Angelopoulos, A., Lei, L., Malik, J., Jordan, M.I.,
-
[4]
Lof: identifying density-based local outliers, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. Cao, Y., et al.,
2000
-
[5]
Machine learning on public intrusion datasets: Academic hype or concrete advances in NIDS?, in: 53rd Annual IEEE/IFIP International Conference on Dependable Sys- tems and Networks Supplementary Volume (DSN-S), IEEE. pp. 132–136. doi:10.1109/DSN-S58398.2023.00038. Damaševičius, R., Venckauskas, A., Grigaliunas, Š., Toldinas, J., Morke- vičius, N., Aleliuna...
-
[6]
Electron- ics 9,
LITNET-2020: an annotated real-world network flow dataset for network intrusion detection. Electron- ics 9,
2020
-
[7]
Troubleshooting an intrusion detection dataset: the CICIDS2017 case study, in: 2021 IEEE Security and Privacy Workshops (SPW), pp. 7–12. Farinhas, A., Zerva, C., Ulmer, D.T., Martins, A.F.T.,
2021
-
[8]
Adaptive conformal inference under dis- tribution shift, in: Advances in Neural Information Processing Systems (NeurIPS). ArXiv:2106.00170. 52 Gibbs, I., Candès, E.J.,
-
[9]
Journal of Machine Learning Research 25, 1–36
Conformal inference for online prediction with arbitrary distribution shifts. Journal of Machine Learning Research 25, 1–36. ArXiv:2208.08401. Guha, S., Mishra, N., Roy, G., Schrijvers, O.,
-
[10]
Errors in the CICIDS2017 dataset and the significant differences in detec- tion performances it makes, in: Risks and Security of Internet and Systems (CRiSIS 2022), Springer. pp. 18–33. doi:10.1007/978-3-031-31108-6_2. Li, Z., Zhao, Y., Botta, N., Ionescu, C., Hu, X.,
-
[11]
2020 IEEE International Conference on Data Mining (ICDM) , 1118–1123
Copod: copula-based outlier detection. 2020 IEEE International Conference on Data Mining (ICDM) , 1118–1123. Li, Z., Zhao, Y., Hu, X., Botta, N., Ionescu, C., Chen, G.H.,
2020
-
[12]
Error preva- lence in NIDS datasets: a case study on CIC-IDS-2017 and CSE-CIC-IDS-
2017
-
[13]
53 Lorden, G.,
2022 IEEE Conference on Communications and Network Security (CNS) , 254–262. 53 Lorden, G.,
2022
-
[14]
The Annals of Mathematical Statistics 42, 1897–1908
Procedures for reacting to a change in distribution. The Annals of Mathematical Statistics 42, 1897–1908. doi:10.1214/aoms/ 1177693055. Manzoor, E., Lamba, H., Akoglu, L.,
-
[15]
1963–1972
xstream: outlier detection in feature-evolving data streams, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 1963–1972. Mirsky, Y., Doitshman, T., Elovici, Y., Shabtai, A.,
1963
-
[16]
UNSW-NB15: a comprehensive data set for network intrusion detection systems, in: 2015 Military Communications and Information Systems Conference (MilCIS), pp. 1–6. Niculescu-Mizil, A., Caruana, R.,
2015
-
[17]
Biometrika 41(1-2), 100–115 (1954) https: //doi.org/10.1093/biomet/41.1-2.100
Continuous inspection schemes. Biometrika 41, 100–115. doi:10.1093/biomet/41.1-2.100. Pevný, T.,
-
[18]
Confor- mal prediction under covariate shift, in: Advances in Neural Information Processing Systems (NeurIPS). ArXiv:1904.06019. Wald, A.,
-
[19]
The Annals of Mathematical Statistics 16, 117–186
Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16, 117–186. doi:10.1214/aoms/1177731118. Wilcoxon, F.,
-
[20]
INFORMS Journal on Data Science 4, 101–113
Cost-aware calibration of classifiers. INFORMS Journal on Data Science 4, 101–113. doi:10.1287/ijds.2024.0038. Yilmaz, S.F., Kozat, S.S.,
-
[21]
arXiv preprint arXiv:2009.02572
PySAD: A streaming anomaly detection framework in Python. arXiv preprint arXiv:2009.02572 . Zadrozny, B., Elkan, C.,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.