pith. sign in

arxiv: 2605.24696 · v2 · pith:US7MYAA3new · submitted 2026-05-23 · 💻 cs.CR · cs.LG

CALIBURN: Operationally Calibrated Streaming Intrusion Detection with Regime-Dependent Conformal Risk Control

Pith reviewed 2026-06-30 12:47 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords streaming intrusion detectionconformal risk controlisotonic calibrationchange-point detectionalert thresholdingregime dependencefalse-positive controlburn-rate alerting
0
0 comments X

The pith

Streaming intrusion detection can set its alerting threshold directly from operator budgets and costs via conformal risk control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CALIBURN, a streaming pipeline that combines Bayesian online change-point detection, isotonic calibration to conditional attack probability, cost-sensitive thresholding, a conformal risk control wrapper, and multi-window burn-rate alerting. It shows that the behavior of calibration and conformal risk control varies strongly with attack prevalence across three datasets. The integration allows thresholds to come from operational parameters instead of label-dependent search after training. In the low-prevalence regime the method reaches an AUC-PR of 0.943 and substantially outperforms both streaming and batch baselines.

Core claim

The behaviour of calibration and conformal risk control is strongly regime-dependent across attack prevalence. Across three regimes -- LITNET-2020 (5.2%), CICIDS2017 (22%), UNSW-NB15 (64%) -- CALIBURN reaches AUC-PR 0.943 in the rare-attack regime it targets, beating the best streaming baseline by 2.21x and the best batch reference by 4.12x, with isotonic calibration cutting Brier score 30%; it stays strongest among streaming methods at moderate prevalence; and all converge to the prevalence floor under base-rate inversion.

What carries the argument

The Conformal Risk Control wrapper that maps a pre-specified alert budget alpha to a false-positive-bounded threshold under the exchangeability assumption.

If this is right

  • In rare-attack streams the integrated pipeline maintains high detection performance.
  • At moderate prevalence it remains the strongest among streaming methods.
  • Under high prevalence performance of all methods approaches the base-rate floor.
  • The high-prevalence collapse is intrinsic to streaming rather than a dataset artifact.
  • CRC overshoot of 2B/(n0+1) and empirical-density degeneracy limit conformal alerting at very small alpha.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed regime dependence may appear in other streaming binary classification tasks that use conformal wrappers.
  • The two proposed pre-deployment checks for CRC failure could be applied to conformal methods outside security.
  • Adjusting the change-point detection layer might extend the pipeline to domains with different temporal structure.

Load-bearing premise

The data stream must satisfy the exchangeability assumption so the conformal risk control bound holds.

What would settle it

Run the pipeline on a new stream where successive observations are temporally dependent and check whether the realized false-positive rate exceeds the bound promised by the chosen alpha.

Figures

Figures reproduced from arXiv: 2605.24696 by Michel A. Youssef.

Figure 1
Figure 1. Figure 1: CALIBURN architecture, organized into three responsibility layers. Streaming [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Truncated BOCPD posterior dynamics on a synthetic stream with one change [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multi-window burn-rate alerting on a synthetic event stream. (a) shows a [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LITNET-2020 AUC-PR across all evaluated methods. Bars show the 3-seed [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Regime sensitivity of AUC-PR across the three NIDS datasets, ordered by attack [PITH_FULL_IMAGE:figures/full_fig_p035_5.png] view at source ↗
read the original abstract

Streaming intrusion detection systems must process flows continuously under bounded memory, yet most leave alerting-threshold selection as a post-hoc tuning problem incompatible with production, where operators commit in advance to alert budgets, misclassification costs, and Service Level Objectives. We present CALIBURN, a streaming alerting pipeline that derives its decision threshold from these operational inputs rather than a label-dependent search. CALIBURN composes five layers on one streaming substrate: truncated Bayesian online change-point detection; isotonic calibration of the posterior to a conditional attack probability; cost-sensitive thresholding from operator costs; a Conformal Risk Control (CRC) wrapper mapping an alert budget alpha to a false-positive-bounded threshold under exchangeability; and multi-window burn-rate alerting from Site Reliability Engineering. Each layer is established; the contribution is the integration and a falsifiable finding about it: the behaviour of calibration and conformal risk control is strongly regime-dependent across attack prevalence. Across three regimes -- LITNET-2020 (5.2%), CICIDS2017 (22%), UNSW-NB15 (64%) -- CALIBURN reaches AUC-PR 0.943 in the rare-attack regime it targets, beating the best streaming baseline by 2.21x and the best batch reference by 4.12x, with isotonic calibration cutting Brier score 30%; it stays strongest among streaming methods at moderate prevalence; and all converge to the prevalence floor under base-rate inversion. A TTL-feature ablation shows this high-prevalence collapse is intrinsic to streaming, not a dataset artifact. We further identify two mechanisms -- a theoretical CRC overshoot 2B/(n0+1) and an empirical-density degeneracy -- collapsing conformal alerting at very small alpha, and propose both as pre-deployment checks. Code and artifacts: Apache 2.0, Zenodo DOI 10.5281/zenodo.20074590.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces CALIBURN, a five-layer streaming IDS pipeline (truncated Bayesian online change-point detection, isotonic calibration to conditional attack probability, cost-sensitive thresholding, CRC wrapper for alert-budget alpha under exchangeability, and multi-window burn-rate alerting) that derives thresholds from operational inputs rather than post-hoc search. It reports regime-dependent empirical behavior across three datasets with attack prevalences 5.2% (LITNET-2020), 22% (CICIDS2017), and 64% (UNSW-NB15), achieving AUC-PR 0.943 in the low-prevalence regime (2.21x over best streaming baseline, 4.12x over best batch reference, 30% Brier reduction from isotonic calibration), with all methods converging at high prevalence; it also identifies CRC overshoot 2B/(n0+1) and density degeneracy as failure modes at small alpha. Code and artifacts are released.

Significance. If the empirical results and CRC guarantees hold under the stated conditions, the work offers a practical, operator-driven alternative to label-dependent threshold tuning in streaming IDS, with the regime-dependent finding and pre-deployment checks providing falsifiable guidance. The public code release and Zenodo artifacts strengthen reproducibility.

major comments (1)
  1. [CRC wrapper and exchangeability assumption] CRC wrapper description (abstract and methods): the false-positive bound for alert budget alpha is stated to hold 'under exchangeability' for the streaming substrate, yet network flows exhibit autocorrelation, concept drift, and non-stationarity that typically violate exchangeability. No blocking, time-series conformal adjustments, or dependence-robust bounds are described, which directly undermines the central claim that the pipeline 'derives its decision threshold from these operational inputs' via CRC.
minor comments (2)
  1. [Ablation study] The TTL-feature ablation is cited to show high-prevalence collapse is intrinsic to streaming, but the precise definition of the TTL features and the windowing used in the ablation are not detailed enough to allow independent verification of the claim.
  2. [CRC failure modes] The theoretical CRC overshoot formula 2B/(n0+1) is presented as a pre-deployment check, but the derivation of B and n0 in the streaming context is not expanded.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the exchangeability assumption. We agree that this is a substantive limitation and will revise the manuscript to clarify the scope of the CRC guarantee and its implications for the central claim.

read point-by-point responses
  1. Referee: [CRC wrapper and exchangeability assumption] CRC wrapper description (abstract and methods): the false-positive bound for alert budget alpha is stated to hold 'under exchangeability' for the streaming substrate, yet network flows exhibit autocorrelation, concept drift, and non-stationarity that typically violate exchangeability. No blocking, time-series conformal adjustments, or dependence-robust bounds are described, which directly undermines the central claim that the pipeline 'derives its decision threshold from these operational inputs' via CRC.

    Authors: We thank the referee for this observation. The manuscript states the CRC bound holds 'under exchangeability' (Section 4.4 and abstract). We acknowledge that streaming network flows violate exchangeability through autocorrelation, concept drift, and non-stationarity, and that the paper introduces no blocking, time-series conformal adjustments, or dependence-robust bounds. This is a genuine limitation: the CRC layer supplies an operational threshold only when the assumption holds approximately, and the central claim is therefore qualified rather than unconditional. We will revise the manuscript to (i) state the limitation more explicitly in the abstract, methods, and discussion, (ii) add the exchangeability violation to the list of pre-deployment checks alongside CRC overshoot and density degeneracy, and (iii) note potential future mitigations such as blocking or dependence-robust conformal methods. No new technical development is claimed in the current work. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical integration of established layers on public data

full rationale

The paper composes five established layers (Bayesian change-point detection, isotonic calibration, cost-sensitive thresholding, CRC under exchangeability, burn-rate alerting) and reports regime-dependent empirical behavior on three public datasets (LITNET-2020, CICIDS2017, UNSW-NB15) with released code. The central claim is an observation about AUC-PR, Brier score, and prevalence effects, not a derivation that reduces by the paper's equations to quantities defined only in terms of its own fitted parameters or self-citations. The exchangeability assumption for CRC is stated as an external precondition rather than derived internally. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The system composes established components; the main unproven premise is the exchangeability assumption needed for CRC guarantees. No new entities are postulated and operator inputs (alpha, costs) are treated as external rather than fitted parameters.

free parameters (2)
  • alert budget alpha
    Operator-specified input that directly determines the CRC threshold; not fitted to data inside the paper.
  • misclassification costs
    Operator-defined values used for cost-sensitive thresholding; treated as given inputs.
axioms (1)
  • domain assumption Exchangeability of the data stream for the Conformal Risk Control wrapper
    Invoked when the CRC layer maps the alert budget alpha to a false-positive-bounded threshold; required for the theoretical guarantee 2B/(n0+1) overshoot bound.

pith-pipeline@v0.9.1-grok · 5877 in / 1622 out tokens · 45538 ms · 2026-06-30T12:47:22.056260+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Bayesian Online Changepoint Detection

    Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742 . Alahmadi, B.A., Axon, L., Martinovic, I.,

  2. [2]

    Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

    Confor- mal risk control, in: International Conference on Learning Representations (ICLR). ArXiv:2208.02814. Arp, D., Quiring, E., Pendlebury, F., Warnecke, A., Pierazzi, F., Wressneg- ger, C., Cavallaro, L., Rieck, K.,

  3. [3]

    Annals of Statistics 51, 816–845

    Confor- mal prediction beyond exchangeability. Annals of Statistics 51, 816–845. doi:10.1214/23-AOS2276. Bates, S., Angelopoulos, A., Lei, L., Malik, J., Jordan, M.I.,

  4. [4]

    Lof: identifying density-based local outliers, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. Cao, Y., et al.,

  5. [5]

    Machine learning on public intrusion datasets: Academic hype or concrete advances in NIDS?, in: 53rd Annual IEEE/IFIP International Conference on Dependable Sys- tems and Networks Supplementary Volume (DSN-S), IEEE. pp. 132–136. doi:10.1109/DSN-S58398.2023.00038. Damaševičius, R., Venckauskas, A., Grigaliunas, Š., Toldinas, J., Morke- vičius, N., Aleliuna...

  6. [6]

    Electron- ics 9,

    LITNET-2020: an annotated real-world network flow dataset for network intrusion detection. Electron- ics 9,

  7. [7]

    Troubleshooting an intrusion detection dataset: the CICIDS2017 case study, in: 2021 IEEE Security and Privacy Workshops (SPW), pp. 7–12. Farinhas, A., Zerva, C., Ulmer, D.T., Martins, A.F.T.,

  8. [8]

    ArXiv:2106.00170

    Adaptive conformal inference under dis- tribution shift, in: Advances in Neural Information Processing Systems (NeurIPS). ArXiv:2106.00170. 52 Gibbs, I., Candès, E.J.,

  9. [9]

    Journal of Machine Learning Research 25, 1–36

    Conformal inference for online prediction with arbitrary distribution shifts. Journal of Machine Learning Research 25, 1–36. ArXiv:2208.08401. Guha, S., Mishra, N., Roy, G., Schrijvers, O.,

  10. [10]

    Errors in the CICIDS2017 dataset and the significant differences in detec- tion performances it makes, in: Risks and Security of Internet and Systems (CRiSIS 2022), Springer. pp. 18–33. doi:10.1007/978-3-031-31108-6_2. Li, Z., Zhao, Y., Botta, N., Ionescu, C., Hu, X.,

  11. [11]

    2020 IEEE International Conference on Data Mining (ICDM) , 1118–1123

    Copod: copula-based outlier detection. 2020 IEEE International Conference on Data Mining (ICDM) , 1118–1123. Li, Z., Zhao, Y., Hu, X., Botta, N., Ionescu, C., Chen, G.H.,

  12. [12]

    Error preva- lence in NIDS datasets: a case study on CIC-IDS-2017 and CSE-CIC-IDS-

  13. [13]

    53 Lorden, G.,

    2022 IEEE Conference on Communications and Network Security (CNS) , 254–262. 53 Lorden, G.,

  14. [14]

    The Annals of Mathematical Statistics 42, 1897–1908

    Procedures for reacting to a change in distribution. The Annals of Mathematical Statistics 42, 1897–1908. doi:10.1214/aoms/ 1177693055. Manzoor, E., Lamba, H., Akoglu, L.,

  15. [15]

    1963–1972

    xstream: outlier detection in feature-evolving data streams, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 1963–1972. Mirsky, Y., Doitshman, T., Elovici, Y., Shabtai, A.,

  16. [16]

    UNSW-NB15: a comprehensive data set for network intrusion detection systems, in: 2015 Military Communications and Information Systems Conference (MilCIS), pp. 1–6. Niculescu-Mizil, A., Caruana, R.,

  17. [17]

    Biometrika 41(1-2), 100–115 (1954) https: //doi.org/10.1093/biomet/41.1-2.100

    Continuous inspection schemes. Biometrika 41, 100–115. doi:10.1093/biomet/41.1-2.100. Pevný, T.,

  18. [18]

    ArXiv:1904.06019

    Confor- mal prediction under covariate shift, in: Advances in Neural Information Processing Systems (NeurIPS). ArXiv:1904.06019. Wald, A.,

  19. [19]

    The Annals of Mathematical Statistics 16, 117–186

    Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16, 117–186. doi:10.1214/aoms/1177731118. Wilcoxon, F.,

  20. [20]

    INFORMS Journal on Data Science 4, 101–113

    Cost-aware calibration of classifiers. INFORMS Journal on Data Science 4, 101–113. doi:10.1287/ijds.2024.0038. Yilmaz, S.F., Kozat, S.S.,

  21. [21]

    arXiv preprint arXiv:2009.02572

    PySAD: A streaming anomaly detection framework in Python. arXiv preprint arXiv:2009.02572 . Zadrozny, B., Elkan, C.,