CALIBURN: Operationally Calibrated Streaming Intrusion Detection with Regime-Dependent Conformal Risk Control

Michel A. Youssef

arxiv: 2605.24696 · v2 · pith:US7MYAA3new · submitted 2026-05-23 · 💻 cs.CR · cs.LG

CALIBURN: Operationally Calibrated Streaming Intrusion Detection with Regime-Dependent Conformal Risk Control

Michel A. Youssef This is my paper

Pith reviewed 2026-06-30 12:47 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords streaming intrusion detectionconformal risk controlisotonic calibrationchange-point detectionalert thresholdingregime dependencefalse-positive controlburn-rate alerting

0 comments

The pith

Streaming intrusion detection can set its alerting threshold directly from operator budgets and costs via conformal risk control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CALIBURN, a streaming pipeline that combines Bayesian online change-point detection, isotonic calibration to conditional attack probability, cost-sensitive thresholding, a conformal risk control wrapper, and multi-window burn-rate alerting. It shows that the behavior of calibration and conformal risk control varies strongly with attack prevalence across three datasets. The integration allows thresholds to come from operational parameters instead of label-dependent search after training. In the low-prevalence regime the method reaches an AUC-PR of 0.943 and substantially outperforms both streaming and batch baselines.

Core claim

The behaviour of calibration and conformal risk control is strongly regime-dependent across attack prevalence. Across three regimes -- LITNET-2020 (5.2%), CICIDS2017 (22%), UNSW-NB15 (64%) -- CALIBURN reaches AUC-PR 0.943 in the rare-attack regime it targets, beating the best streaming baseline by 2.21x and the best batch reference by 4.12x, with isotonic calibration cutting Brier score 30%; it stays strongest among streaming methods at moderate prevalence; and all converge to the prevalence floor under base-rate inversion.

What carries the argument

The Conformal Risk Control wrapper that maps a pre-specified alert budget alpha to a false-positive-bounded threshold under the exchangeability assumption.

If this is right

In rare-attack streams the integrated pipeline maintains high detection performance.
At moderate prevalence it remains the strongest among streaming methods.
Under high prevalence performance of all methods approaches the base-rate floor.
The high-prevalence collapse is intrinsic to streaming rather than a dataset artifact.
CRC overshoot of 2B/(n0+1) and empirical-density degeneracy limit conformal alerting at very small alpha.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed regime dependence may appear in other streaming binary classification tasks that use conformal wrappers.
The two proposed pre-deployment checks for CRC failure could be applied to conformal methods outside security.
Adjusting the change-point detection layer might extend the pipeline to domains with different temporal structure.

Load-bearing premise

The data stream must satisfy the exchangeability assumption so the conformal risk control bound holds.

What would settle it

Run the pipeline on a new stream where successive observations are temporally dependent and check whether the realized false-positive rate exceeds the bound promised by the chosen alpha.

Figures

Figures reproduced from arXiv: 2605.24696 by Michel A. Youssef.

**Figure 2.** Figure 2: Truncated BOCPD posterior dynamics on a synthetic stream with one change [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-window burn-rate alerting on a synthetic event stream. (a) shows a [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 4.** Figure 4: LITNET-2020 AUC-PR across all evaluated methods. Bars show the 3-seed [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗

**Figure 5.** Figure 5: Regime sensitivity of AUC-PR across the three NIDS datasets, ordered by attack [PITH_FULL_IMAGE:figures/full_fig_p035_5.png] view at source ↗

read the original abstract

Streaming intrusion detection systems must process flows continuously under bounded memory, yet most leave alerting-threshold selection as a post-hoc tuning problem incompatible with production, where operators commit in advance to alert budgets, misclassification costs, and Service Level Objectives. We present CALIBURN, a streaming alerting pipeline that derives its decision threshold from these operational inputs rather than a label-dependent search. CALIBURN composes five layers on one streaming substrate: truncated Bayesian online change-point detection; isotonic calibration of the posterior to a conditional attack probability; cost-sensitive thresholding from operator costs; a Conformal Risk Control (CRC) wrapper mapping an alert budget alpha to a false-positive-bounded threshold under exchangeability; and multi-window burn-rate alerting from Site Reliability Engineering. Each layer is established; the contribution is the integration and a falsifiable finding about it: the behaviour of calibration and conformal risk control is strongly regime-dependent across attack prevalence. Across three regimes -- LITNET-2020 (5.2%), CICIDS2017 (22%), UNSW-NB15 (64%) -- CALIBURN reaches AUC-PR 0.943 in the rare-attack regime it targets, beating the best streaming baseline by 2.21x and the best batch reference by 4.12x, with isotonic calibration cutting Brier score 30%; it stays strongest among streaming methods at moderate prevalence; and all converge to the prevalence floor under base-rate inversion. A TTL-feature ablation shows this high-prevalence collapse is intrinsic to streaming, not a dataset artifact. We further identify two mechanisms -- a theoretical CRC overshoot 2B/(n0+1) and an empirical-density degeneracy -- collapsing conformal alerting at very small alpha, and propose both as pre-deployment checks. Code and artifacts: Apache 2.0, Zenodo DOI 10.5281/zenodo.20074590.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CALIBURN integrates existing layers into a streaming IDS and reports strong regime dependence in calibration and CRC performance, but the exchangeability assumption for the CRC bound looks shaky for real network flows.

read the letter

Two things stand out. CALIBURN stitches together Bayesian online change-point detection, isotonic calibration, cost-sensitive thresholds, a conformal risk control wrapper, and burn-rate alerting into one streaming pipeline that sets thresholds from operator budgets and costs rather than label search. It also reports that calibration and CRC behavior changes sharply with attack prevalence, with solid numbers in the low-prevalence regime it targets.

The paper does a few things right. Code and artifacts are released under Apache 2.0 with a Zenodo DOI, which makes the empirical claims checkable. The finding that performance across methods converges to the prevalence floor at high attack rates, backed by a TTL-feature ablation showing it is intrinsic to streaming, is a useful observation. They also flag a theoretical CRC overshoot term and an empirical density degeneracy as practical pre-deployment checks.

The exchangeability assumption required for the CRC layer to deliver a false-positive bound tied to alert budget alpha is the clearest soft spot. Network flows show autocorrelation, concept drift, and non-stationarity, so the scores are unlikely to be exchangeable. The abstract invokes the assumption without mentioning time-series conformal adjustments or dependence-robust bounds, which weakens the operational guarantee even if the AUC-PR numbers hold.

The reported metrics (0.943 AUC-PR in the rare-attack case, 2.21x over streaming baselines, 30% Brier reduction) are concrete, but the review rests on the abstract, so details on data splits, hyperparameter choices, and whether any post-hoc tuning occurred would need checking in the full text.

This is for people working on production streaming detection systems where alert budgets and misclassification costs are fixed in advance. A reader who cares about aligning ML outputs with operational constraints would get value from the regime analysis and the proposed checks.

I would send it for peer review. The integration is practical, the regime-dependence claim is falsifiable, and the code release supports verification, even with the assumption issue that would likely need addressing in revision.

Referee Report

1 major / 2 minor

Summary. The paper introduces CALIBURN, a five-layer streaming IDS pipeline (truncated Bayesian online change-point detection, isotonic calibration to conditional attack probability, cost-sensitive thresholding, CRC wrapper for alert-budget alpha under exchangeability, and multi-window burn-rate alerting) that derives thresholds from operational inputs rather than post-hoc search. It reports regime-dependent empirical behavior across three datasets with attack prevalences 5.2% (LITNET-2020), 22% (CICIDS2017), and 64% (UNSW-NB15), achieving AUC-PR 0.943 in the low-prevalence regime (2.21x over best streaming baseline, 4.12x over best batch reference, 30% Brier reduction from isotonic calibration), with all methods converging at high prevalence; it also identifies CRC overshoot 2B/(n0+1) and density degeneracy as failure modes at small alpha. Code and artifacts are released.

Significance. If the empirical results and CRC guarantees hold under the stated conditions, the work offers a practical, operator-driven alternative to label-dependent threshold tuning in streaming IDS, with the regime-dependent finding and pre-deployment checks providing falsifiable guidance. The public code release and Zenodo artifacts strengthen reproducibility.

major comments (1)

[CRC wrapper and exchangeability assumption] CRC wrapper description (abstract and methods): the false-positive bound for alert budget alpha is stated to hold 'under exchangeability' for the streaming substrate, yet network flows exhibit autocorrelation, concept drift, and non-stationarity that typically violate exchangeability. No blocking, time-series conformal adjustments, or dependence-robust bounds are described, which directly undermines the central claim that the pipeline 'derives its decision threshold from these operational inputs' via CRC.

minor comments (2)

[Ablation study] The TTL-feature ablation is cited to show high-prevalence collapse is intrinsic to streaming, but the precise definition of the TTL features and the windowing used in the ablation are not detailed enough to allow independent verification of the claim.
[CRC failure modes] The theoretical CRC overshoot formula 2B/(n0+1) is presented as a pre-deployment check, but the derivation of B and n0 in the streaming context is not expanded.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the exchangeability assumption. We agree that this is a substantive limitation and will revise the manuscript to clarify the scope of the CRC guarantee and its implications for the central claim.

read point-by-point responses

Referee: [CRC wrapper and exchangeability assumption] CRC wrapper description (abstract and methods): the false-positive bound for alert budget alpha is stated to hold 'under exchangeability' for the streaming substrate, yet network flows exhibit autocorrelation, concept drift, and non-stationarity that typically violate exchangeability. No blocking, time-series conformal adjustments, or dependence-robust bounds are described, which directly undermines the central claim that the pipeline 'derives its decision threshold from these operational inputs' via CRC.

Authors: We thank the referee for this observation. The manuscript states the CRC bound holds 'under exchangeability' (Section 4.4 and abstract). We acknowledge that streaming network flows violate exchangeability through autocorrelation, concept drift, and non-stationarity, and that the paper introduces no blocking, time-series conformal adjustments, or dependence-robust bounds. This is a genuine limitation: the CRC layer supplies an operational threshold only when the assumption holds approximately, and the central claim is therefore qualified rather than unconditional. We will revise the manuscript to (i) state the limitation more explicitly in the abstract, methods, and discussion, (ii) add the exchangeability violation to the list of pre-deployment checks alongside CRC overshoot and density degeneracy, and (iii) note potential future mitigations such as blocking or dependence-robust conformal methods. No new technical development is claimed in the current work. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical integration of established layers on public data

full rationale

The paper composes five established layers (Bayesian change-point detection, isotonic calibration, cost-sensitive thresholding, CRC under exchangeability, burn-rate alerting) and reports regime-dependent empirical behavior on three public datasets (LITNET-2020, CICIDS2017, UNSW-NB15) with released code. The central claim is an observation about AUC-PR, Brier score, and prevalence effects, not a derivation that reduces by the paper's equations to quantities defined only in terms of its own fitted parameters or self-citations. The exchangeability assumption for CRC is stated as an external precondition rather than derived internally. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The system composes established components; the main unproven premise is the exchangeability assumption needed for CRC guarantees. No new entities are postulated and operator inputs (alpha, costs) are treated as external rather than fitted parameters.

free parameters (2)

alert budget alpha
Operator-specified input that directly determines the CRC threshold; not fitted to data inside the paper.
misclassification costs
Operator-defined values used for cost-sensitive thresholding; treated as given inputs.

axioms (1)

domain assumption Exchangeability of the data stream for the Conformal Risk Control wrapper
Invoked when the CRC layer maps the alert budget alpha to a false-positive-bounded threshold; required for the theoretical guarantee 2B/(n0+1) overshoot bound.

pith-pipeline@v0.9.1-grok · 5877 in / 1622 out tokens · 45538 ms · 2026-06-30T12:47:22.056260+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Bayesian Online Changepoint Detection

Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742 . Alahmadi, B.A., Axon, L., Martinovic, I.,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

ArXiv:2208.02814

Confor- mal risk control, in: International Conference on Learning Representations (ICLR). ArXiv:2208.02814. Arp, D., Quiring, E., Pendlebury, F., Warnecke, A., Pierazzi, F., Wressneg- ger, C., Cavallaro, L., Rieck, K.,

work page arXiv
[3]

Annals of Statistics 51, 816–845

Confor- mal prediction beyond exchangeability. Annals of Statistics 51, 816–845. doi:10.1214/23-AOS2276. Bates, S., Angelopoulos, A., Lei, L., Malik, J., Jordan, M.I.,

work page doi:10.1214/23-aos2276
[4]

Lof: identifying density-based local outliers, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. Cao, Y., et al.,

2000
[5]

Machine learning on public intrusion datasets: Academic hype or concrete advances in NIDS?, in: 53rd Annual IEEE/IFIP International Conference on Dependable Sys- tems and Networks Supplementary Volume (DSN-S), IEEE. pp. 132–136. doi:10.1109/DSN-S58398.2023.00038. Damaševičius, R., Venckauskas, A., Grigaliunas, Š., Toldinas, J., Morke- vičius, N., Aleliuna...

work page doi:10.1109/dsn-s58398.2023.00038 2023
[6]

Electron- ics 9,

LITNET-2020: an annotated real-world network flow dataset for network intrusion detection. Electron- ics 9,

2020
[7]

Troubleshooting an intrusion detection dataset: the CICIDS2017 case study, in: 2021 IEEE Security and Privacy Workshops (SPW), pp. 7–12. Farinhas, A., Zerva, C., Ulmer, D.T., Martins, A.F.T.,

2021
[8]

ArXiv:2106.00170

Adaptive conformal inference under dis- tribution shift, in: Advances in Neural Information Processing Systems (NeurIPS). ArXiv:2106.00170. 52 Gibbs, I., Candès, E.J.,

work page arXiv
[9]

Journal of Machine Learning Research 25, 1–36

Conformal inference for online prediction with arbitrary distribution shifts. Journal of Machine Learning Research 25, 1–36. ArXiv:2208.08401. Guha, S., Mishra, N., Roy, G., Schrijvers, O.,

work page arXiv
[10]

Errors in the CICIDS2017 dataset and the significant differences in detec- tion performances it makes, in: Risks and Security of Internet and Systems (CRiSIS 2022), Springer. pp. 18–33. doi:10.1007/978-3-031-31108-6_2. Li, Z., Zhao, Y., Botta, N., Ionescu, C., Hu, X.,

work page doi:10.1007/978-3-031-31108-6_2 2022
[11]

2020 IEEE International Conference on Data Mining (ICDM) , 1118–1123

Copod: copula-based outlier detection. 2020 IEEE International Conference on Data Mining (ICDM) , 1118–1123. Li, Z., Zhao, Y., Hu, X., Botta, N., Ionescu, C., Chen, G.H.,

2020
[12]

Error preva- lence in NIDS datasets: a case study on CIC-IDS-2017 and CSE-CIC-IDS-

2017
[13]

53 Lorden, G.,

2022 IEEE Conference on Communications and Network Security (CNS) , 254–262. 53 Lorden, G.,

2022
[14]

The Annals of Mathematical Statistics 42, 1897–1908

Procedures for reacting to a change in distribution. The Annals of Mathematical Statistics 42, 1897–1908. doi:10.1214/aoms/ 1177693055. Manzoor, E., Lamba, H., Akoglu, L.,

work page doi:10.1214/aoms/ 1908
[15]

1963–1972

xstream: outlier detection in feature-evolving data streams, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 1963–1972. Mirsky, Y., Doitshman, T., Elovici, Y., Shabtai, A.,

1963
[16]

UNSW-NB15: a comprehensive data set for network intrusion detection systems, in: 2015 Military Communications and Information Systems Conference (MilCIS), pp. 1–6. Niculescu-Mizil, A., Caruana, R.,

2015
[17]

Biometrika 41(1-2), 100–115 (1954) https: //doi.org/10.1093/biomet/41.1-2.100

Continuous inspection schemes. Biometrika 41, 100–115. doi:10.1093/biomet/41.1-2.100. Pevný, T.,

work page doi:10.1093/biomet/41.1-2.100
[18]

ArXiv:1904.06019

Confor- mal prediction under covariate shift, in: Advances in Neural Information Processing Systems (NeurIPS). ArXiv:1904.06019. Wald, A.,

work page arXiv 1904
[19]

The Annals of Mathematical Statistics 16, 117–186

Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16, 117–186. doi:10.1214/aoms/1177731118. Wilcoxon, F.,

work page doi:10.1214/aoms/1177731118
[20]

INFORMS Journal on Data Science 4, 101–113

Cost-aware calibration of classifiers. INFORMS Journal on Data Science 4, 101–113. doi:10.1287/ijds.2024.0038. Yilmaz, S.F., Kozat, S.S.,

work page doi:10.1287/ijds.2024.0038 2024
[21]

arXiv preprint arXiv:2009.02572

PySAD: A streaming anomaly detection framework in Python. arXiv preprint arXiv:2009.02572 . Zadrozny, B., Elkan, C.,

work page arXiv 2009

[1] [1]

Bayesian Online Changepoint Detection

Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742 . Alahmadi, B.A., Axon, L., Martinovic, I.,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

ArXiv:2208.02814

Confor- mal risk control, in: International Conference on Learning Representations (ICLR). ArXiv:2208.02814. Arp, D., Quiring, E., Pendlebury, F., Warnecke, A., Pierazzi, F., Wressneg- ger, C., Cavallaro, L., Rieck, K.,

work page arXiv

[3] [3]

Annals of Statistics 51, 816–845

Confor- mal prediction beyond exchangeability. Annals of Statistics 51, 816–845. doi:10.1214/23-AOS2276. Bates, S., Angelopoulos, A., Lei, L., Malik, J., Jordan, M.I.,

work page doi:10.1214/23-aos2276

[4] [4]

Lof: identifying density-based local outliers, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. Cao, Y., et al.,

2000

[5] [5]

Machine learning on public intrusion datasets: Academic hype or concrete advances in NIDS?, in: 53rd Annual IEEE/IFIP International Conference on Dependable Sys- tems and Networks Supplementary Volume (DSN-S), IEEE. pp. 132–136. doi:10.1109/DSN-S58398.2023.00038. Damaševičius, R., Venckauskas, A., Grigaliunas, Š., Toldinas, J., Morke- vičius, N., Aleliuna...

work page doi:10.1109/dsn-s58398.2023.00038 2023

[6] [6]

Electron- ics 9,

LITNET-2020: an annotated real-world network flow dataset for network intrusion detection. Electron- ics 9,

2020

[7] [7]

Troubleshooting an intrusion detection dataset: the CICIDS2017 case study, in: 2021 IEEE Security and Privacy Workshops (SPW), pp. 7–12. Farinhas, A., Zerva, C., Ulmer, D.T., Martins, A.F.T.,

2021

[8] [8]

ArXiv:2106.00170

Adaptive conformal inference under dis- tribution shift, in: Advances in Neural Information Processing Systems (NeurIPS). ArXiv:2106.00170. 52 Gibbs, I., Candès, E.J.,

work page arXiv

[9] [9]

Journal of Machine Learning Research 25, 1–36

Conformal inference for online prediction with arbitrary distribution shifts. Journal of Machine Learning Research 25, 1–36. ArXiv:2208.08401. Guha, S., Mishra, N., Roy, G., Schrijvers, O.,

work page arXiv

[10] [10]

Errors in the CICIDS2017 dataset and the significant differences in detec- tion performances it makes, in: Risks and Security of Internet and Systems (CRiSIS 2022), Springer. pp. 18–33. doi:10.1007/978-3-031-31108-6_2. Li, Z., Zhao, Y., Botta, N., Ionescu, C., Hu, X.,

work page doi:10.1007/978-3-031-31108-6_2 2022

[11] [11]

2020 IEEE International Conference on Data Mining (ICDM) , 1118–1123

Copod: copula-based outlier detection. 2020 IEEE International Conference on Data Mining (ICDM) , 1118–1123. Li, Z., Zhao, Y., Hu, X., Botta, N., Ionescu, C., Chen, G.H.,

2020

[12] [12]

Error preva- lence in NIDS datasets: a case study on CIC-IDS-2017 and CSE-CIC-IDS-

2017

[13] [13]

53 Lorden, G.,

2022 IEEE Conference on Communications and Network Security (CNS) , 254–262. 53 Lorden, G.,

2022

[14] [14]

The Annals of Mathematical Statistics 42, 1897–1908

Procedures for reacting to a change in distribution. The Annals of Mathematical Statistics 42, 1897–1908. doi:10.1214/aoms/ 1177693055. Manzoor, E., Lamba, H., Akoglu, L.,

work page doi:10.1214/aoms/ 1908

[15] [15]

1963–1972

xstream: outlier detection in feature-evolving data streams, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 1963–1972. Mirsky, Y., Doitshman, T., Elovici, Y., Shabtai, A.,

1963

[16] [16]

UNSW-NB15: a comprehensive data set for network intrusion detection systems, in: 2015 Military Communications and Information Systems Conference (MilCIS), pp. 1–6. Niculescu-Mizil, A., Caruana, R.,

2015

[17] [17]

Biometrika 41(1-2), 100–115 (1954) https: //doi.org/10.1093/biomet/41.1-2.100

Continuous inspection schemes. Biometrika 41, 100–115. doi:10.1093/biomet/41.1-2.100. Pevný, T.,

work page doi:10.1093/biomet/41.1-2.100

[18] [18]

ArXiv:1904.06019

Confor- mal prediction under covariate shift, in: Advances in Neural Information Processing Systems (NeurIPS). ArXiv:1904.06019. Wald, A.,

work page arXiv 1904

[19] [19]

The Annals of Mathematical Statistics 16, 117–186

Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16, 117–186. doi:10.1214/aoms/1177731118. Wilcoxon, F.,

work page doi:10.1214/aoms/1177731118

[20] [20]

INFORMS Journal on Data Science 4, 101–113

Cost-aware calibration of classifiers. INFORMS Journal on Data Science 4, 101–113. doi:10.1287/ijds.2024.0038. Yilmaz, S.F., Kozat, S.S.,

work page doi:10.1287/ijds.2024.0038 2024

[21] [21]

arXiv preprint arXiv:2009.02572

PySAD: A streaming anomaly detection framework in Python. arXiv preprint arXiv:2009.02572 . Zadrozny, B., Elkan, C.,

work page arXiv 2009