Improving Outbreak Detection with Stacking of Statistical Surveillance Methods

Eneldo Loza Menc\'ia; Johannes F\"urnkranz; Moritz Kulessa

arxiv: 1907.07464 · v1 · pith:SBX6V3YCnew · submitted 2019-07-17 · 💻 cs.LG · q-bio.QM· stat.ML

Improving Outbreak Detection with Stacking of Statistical Surveillance Methods

Moritz Kulessa , Eneldo Loza Menc\'ia , Johannes F\"urnkranz This is my paper

Pith reviewed 2026-05-24 20:25 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QMstat.ML

keywords outbreak detectionstatistical surveillancemachine learning fusionp-valuesstacking classifierfalse alarm ratesynthetic data

0 comments

The pith

Using p-values from statistical outbreak detectors for machine learning fusion improves detection over binary alarms on synthetic data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a machine learning classifier can combine multiple statistical surveillance algorithms to detect disease outbreaks more effectively than any single algorithm. It replaces the usual binary alarm signals with the underlying p-values as input features, adds further information, and changes how epidemic periods are labeled during training. A new evaluation measure is introduced that focuses more precisely on keeping false alarms low. Experiments on synthetic streams show that fusion based only on binary outputs often performs worse than the base methods, while the p-value approach identifies stronger patterns and yields better results under strict false-alarm limits.

Core claim

A fusion classifier trained only on the binary outputs of statistical surveillance methods can lower overall performance below that of the individual algorithms, yet replacing those binary signals with the algorithms' p-values together with additional features and adjusted epidemic labeling enables the model to find more valuable detection patterns and improves the trade-off between outbreak detection and false alarms.

What carries the argument

A stacking classifier trained on p-values and supplementary features from multiple statistical outbreak detection algorithms.

If this is right

Binary-output fusion can reduce performance below that of the underlying statistical methods.
P-value inputs let the fusion model exploit more informative patterns than binary decisions alone.
Additional features and adapted epidemic labeling can further raise fusion performance.
The new performance measure supplies a finer assessment of methods when false-alarm rates must remain very low.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the p-value advantage persists on real streams, surveillance systems could layer lightweight machine learning on top of existing statistical tools without replacing them.
The same p-value stacking idea could be tested in other anomaly-detection settings that already run multiple detectors in parallel.
Experiments that vary the number and diversity of base algorithms would show whether the observed gain scales or saturates.

Load-bearing premise

The synthetic data generation process and labeling rules used for training and evaluation accurately reflect the statistical properties and labeling challenges of real-world disease surveillance streams.

What would settle it

Applying the p-value fusion classifier to a large set of real disease incidence time series and measuring no gain in detection rate at a fixed low false-alarm level compared with the strongest single statistical method would disprove the reported improvement.

Figures

Figures reproduced from arXiv: 1907.07464 by Eneldo Loza Menc\'ia, Johannes F\"urnkranz, Moritz Kulessa.

**Figure 1.** Figure 1: Example for the creation of training data for the learning algorithm including the statistical algorithms Bayes and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: ROC curve using the detection rate on the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Results for the measures dAUC1% and pAUC1%. Each box plot represents the distribution of measure values for a particular method computed over all 42 test cases for a fixed outbreak size defined by the parameter k (a bigger value for k indicate more cases per outbreak). In general, we can observe that by narrowing the labeling of the outbreak on particular events (i.e., O1, O2 or O3) a better performance c… view at source ↗

read the original abstract

Epidemiologists use a variety of statistical algorithms for the early detection of outbreaks. The practical usefulness of such methods highly depends on the trade-off between the detection rate of outbreaks and the chances of raising a false alarm. Recent research has shown that the use of machine learning for the fusion of multiple statistical algorithms improves outbreak detection. Instead of relying only on the binary output (alarm or no alarm) of the statistical algorithms, we propose to make use of their p-values for training a fusion classifier. In addition, we also show that adding additional features and adapting the labeling of an epidemic period may further improve performance. For comparison and evaluation, a new measure is introduced which captures the performance of an outbreak detection method with respect to a low rate of false alarms more precisely than previous works. Our results on synthetic data show that it is challenging to improve the performance with a trainable fusion method based on machine learning. In particular, the use of a fusion classifier that is only based on binary outputs of the statistical surveillance methods can make the overall performance worse than directly using the underlying algorithms. However, the use of p-values and additional information for the learning is promising, enabling to identify more valuable patterns to detect outbreaks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Binary stacking can hurt outbreak detection on synthetic data while p-value inputs look better, but the gains rest on unvalidated synthetic streams.

read the letter

The main takeaway is that a fusion classifier using binary alarms from standard surveillance methods can end up worse than the base algorithms alone, while switching to p-values as inputs appears more promising for catching useful patterns on the synthetic streams they tested. They also add extra features, adjust epidemic period labeling, and introduce a new metric focused on low false-alarm performance. The paper applies stacking in this specific setting and reports that the binary version degrades results while the p-value version does not. That negative result on binary fusion is the clearest practical takeaway, and using p-values is a logical step since they preserve more information than a hard alarm. The new measure is a reasonable attempt to evaluate methods under the constraint that false alarms must stay rare. The soft spot is the complete dependence on synthetic data. The generation process and labeling rules may not capture real reporting delays, baseline shifts, or fuzzy outbreak onsets, so the observed advantage for p-values could disappear outside their model. No real surveillance streams are mentioned, and the abstract gives no quantitative effect sizes or variability measures. This is for people already working on statistical surveillance or applying ensembles to sequential anomaly detection. A reader in that area could pick up the p-value idea or the caution on binary fusion. It deserves peer review because the core application is clean and the binary-fusion warning is worth checking, even if the synthetic limitation will need addressing.

Referee Report

2 major / 2 minor

Summary. The paper proposes stacking statistical surveillance methods for early outbreak detection by training a machine-learning fusion classifier on p-values (instead of binary alarms) plus additional features, with an adapted labeling scheme for epidemic periods. It introduces a new performance measure emphasizing low false-alarm regimes and reports, on synthetic data, that binary-alarm fusion can degrade performance while p-value fusion appears promising for identifying more valuable outbreak patterns.

Significance. If the synthetic-data results generalize, the work would demonstrate that p-value inputs allow a trainable fusion method to improve the detection/false-alarm trade-off beyond individual statistical algorithms, which is practically relevant for epidemiological surveillance. The new low-false-alarm measure is a useful addition to the evaluation toolkit. The findings remain conditional on the fidelity of the authors' synthetic generation and labeling process to real surveillance streams.

major comments (2)

[§4] §4 (Experimental Evaluation): All reported results, including the degradation under binary fusion and the advantage of p-value fusion, rest exclusively on synthetically generated streams and the authors' labeling rule. No real surveillance data, no sensitivity analysis to generation parameters (temporal dependence, reporting delays, ambiguous onset), and no quantitative metrics with error bars are provided, making the central empirical claim only moderately supported and vulnerable to the modeling assumptions highlighted in the stress-test note.
[Abstract, §3.2] Abstract and §3.2: The new performance measure is introduced to capture low false-alarm behavior more precisely, yet the manuscript does not demonstrate that this measure changes the ranking of methods relative to standard metrics or provide its exact functional form and calibration details, which is load-bearing for the claim that p-value stacking is 'promising' under the targeted regime.

minor comments (2)

[§3] Notation for the fusion classifier inputs and the epidemic-period labeling rule should be defined more explicitly with a small example or pseudocode to improve reproducibility.
[§3.1] The manuscript would benefit from a table summarizing the statistical base methods, their p-value definitions, and the exact additional features used in the fusion classifier.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important aspects of our experimental design and the presentation of the new performance measure. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [§4] §4 (Experimental Evaluation): All reported results, including the degradation under binary fusion and the advantage of p-value fusion, rest exclusively on synthetically generated streams and the authors' labeling rule. No real surveillance data, no sensitivity analysis to generation parameters (temporal dependence, reporting delays, ambiguous onset), and no quantitative metrics with error bars are provided, making the central empirical claim only moderately supported and vulnerable to the modeling assumptions highlighted in the stress-test note.

Authors: We agree that the central claims rest on synthetic data and that additional analyses would strengthen support. In the revision we will add sensitivity analysis varying temporal dependence, reporting delays, and onset ambiguity, and we will report all metrics with error bars computed over multiple independent simulation runs. We will also expand the discussion of modeling assumptions. Real surveillance streams with verified outbreak labels remain difficult to obtain at scale; we will therefore add an explicit limitations paragraph on this point rather than attempting to include such data. revision: partial
Referee: [Abstract, §3.2] Abstract and §3.2: The new performance measure is introduced to capture low false-alarm behavior more precisely, yet the manuscript does not demonstrate that this measure changes the ranking of methods relative to standard metrics or provide its exact functional form and calibration details, which is load-bearing for the claim that p-value stacking is 'promising' under the targeted regime.

Authors: We will revise §3.2 to state the exact functional form of the measure together with its calibration procedure. We will also add a short comparative table showing method rankings under the new measure versus standard AUC and F1 at low false-alarm operating points, thereby demonstrating that the measure alters evaluation conclusions in the regime of interest. revision: yes

standing simulated objections not resolved

Obtaining large-scale real surveillance data accompanied by reliable, independently verified outbreak labels is not feasible for this study.

Circularity Check

0 steps flagged

No circularity: empirical stacking evaluated on held-out synthetic data

full rationale

The paper presents an empirical machine-learning approach that trains fusion classifiers on outputs (binary alarms or p-values) from existing statistical surveillance methods, optionally augments with extra features, and evaluates on held-out synthetic streams using a newly defined low-false-alarm performance measure. No derivation reduces a claimed result to its own fitted parameters or to a self-citation chain; the central claims rest on direct experimental comparison rather than any self-definitional or uniqueness-theorem step. The synthetic-data limitation is a validity concern, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation depends on the realism of synthetic data whose generation details are not supplied in the abstract and on standard machine-learning assumptions about feature informativeness and label quality.

axioms (1)

domain assumption Synthetic data accurately models real surveillance streams and epidemic labeling
All reported performance numbers rest on this unverified modeling choice.

pith-pipeline@v0.9.0 · 5757 in / 1193 out tokens · 27056 ms · 2026-05-24T20:25:59.351220+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Bédubourg and Y

G. Bédubourg and Y. Le Strat. 2017. Evaluation and comparison of statistical methods for early temporal detection of outbreaks: A simulation-based study. PLOS ONE 12(7):1–18

work page 2017
[2]

Burkom, L

H. Burkom, L. Ramac-Thomas, S. Babin, R. Holtry, Z. Mnatsakanyan, and C. Yund

work page
[3]

Statistics in Medicine 30(5):470–479

An integrated approach for fusion of environmental and human health data for disease surveillance. Statistics in Medicine 30(5):470–479

work page
[4]

Chakraborty, P

P. Chakraborty, P. Khadivi, B. Lewis, A. Mahendiran, J. Chen, P. Butler, E. Nsoesie, S. Mekaru, J. Brownstein, M. Marathe, and N. Ramakrishnan. 2014. Forecasting a moving target: Ensemble models for ILI case count predictions. In Proceedings of the SIAM International Conference on Data Mining . 262–270

work page 2014
[5]

Farrow, L

D. Farrow, L. Brooks, S. Hyun, R. J. Tibshirani, D. Burke, and R. Rosenfeld. 2017. A human judgment approach to epidemiological forecasting. PLOS Computational Biology 13(3):1–19

work page 2017
[6]

T. Fawcett. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27(8):861–874

work page 2006
[7]

Fernández-Delgado, E

M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research 15:3133–3181

work page 2014
[8]

Fricker Jr., B

R. Fricker Jr., B. Hegler, and D. Dunfee. 2008. Comparing syndromic surveillance detection methods: EARS’ versus a CUSUM-based methodology. Statistics in Medicine 27(17):3407–3429

work page 2008
[9]

Hutwagner, T

L. Hutwagner, T. Browne, G. Seeman, and A. Fleischauer. 2005. Comparing aberration detection methods with simulated data. Journal of Emerging Infectious Diseases 11(2):314–316

work page 2005
[10]

Hutwagner, W

L. Hutwagner, W. Thompson, G. Seeman, and T. Treadwell. 2003. The bioterrorism preparedness and response early aberration reporting system (EARS). Journal of Urban Health 80(1):i89–i96

work page 2003
[11]

Jackson, A

M. Jackson, A. Baer, I. Painter, and J. Duchin. 2007. A simulation study compar- ing aberration detection algorithms for syndromic surveillance. BMC Medical Informatics and Decision Making 7(1):6

work page 2007
[12]

Jafarpour, M

N. Jafarpour, M. Izadi, D. Precup, and D. L. Buckeridge. 2015. Quantifying the determinants of outbreak detection performance through simulation and machine learning. Journal of Biomedical Informatics 53:180–187

work page 2015
[13]

Jafarpour, D

N. Jafarpour, D. Precup, M. Izadi, and D. Buckeridge. 2013. Using hierarchical mixture of experts model for fusion of outbreak detection methods.AMIA Annual Symposium Proceedings 2013:663–669

work page 2013
[14]

Jordan and R

M. Jordan and R. Jacobs. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Computation 6(2):181–214

work page 1994
[15]

Khaleghi, A

B. Khaleghi, A. Khamis, F. Karray, and S. Razavi. 2013. Multisensor data fusion: A review of the state-of-the-art. Information Fusion 14(1):28–44

work page 2013
[16]

Kleinman and A

K. Kleinman and A. Abrams. 2006. Assessing surveillance using sensitivity, specificity and timeliness. Statistical Methods in Medical Research 15(5):445–464

work page 2006
[17]

E. Lau, B. Cowling, L. Ho, and G. Leung. 2008. Optimizing use of multistream influenza sentinel surveillance data. Journal of Emerging Infectious Diseases 14:1154–1157

work page 2008
[18]

H. Ma, A. Bandos, H. Rockette, and D. Gur. 2013. On use of partial area under the ROC curve for evaluation of diagnostic performance. Statistics in Medicine 32(20):3449–3458

work page 2013
[19]

Mnatsakanyan, H

Z. Mnatsakanyan, H. Burkom, J. Coberly, and J. Lombardo. 2009. Bayesian information fusion networks for biosurveillance applications. Journal of the American Medical Informatics Association 16(6):855–863

work page 2009
[20]

Noufaily, D

A. Noufaily, D. Enki, P. Farrington, P. Garthwaite, N. Andrews, and A. Charlett

work page
[21]

Statistics in Medicine 32(7):1206–1222

An improved algorithm for outbreak detection in multiple surveillance systems. Statistics in Medicine 32(7):1206–1222

work page
[22]

Noufaily, R

A. Noufaily, R. Morbey, F. Colón-González, A. Elliot, G. Smith, I. Lake, and N. McCarthy. 2019. Comparison of statistical algorithms for daily syndromic surveillance aberration detection. Bioinformatics. In press

work page 2019
[23]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–2830

work page 2011
[24]

Salmon, D

M. Salmon, D. Schumacher, and M. Höhle. 2016. Monitoring count time series in R: Aberration detection in public health surveillance. Journal of Statistical Software 70(10):1–35

work page 2016
[25]

Shmueli and H

G. Shmueli and H. Burkom. 2010. Statistical challenges facing early outbreak detection in biosurveillance. Technometrics 52(1):39–51

work page 2010
[26]

Texier, R

G. Texier, R. Allodji, L. Diop, J. Meynard, L. Pellegrin, and H. Chaudet. 2019. Using decision fusion methods to improve outbreak detection in disease surveillance. BMC Medical Informatics and Decision Making 19(1):38

work page 2019
[27]

Ting and I

K. Ting and I. Witten. 1999. Issues in stacked generalization. Journal of Artificial Intelligence Research 10:271–289

work page 1999
[28]

D. Wolpert. 1992. Stacked generalization. Neural Networks 5(2):241–259

work page 1992
[29]

Wyner, M

A. Wyner, M. Olson, J. Bleich, and D. Mease. 2017. Explaining the success of AdaBoost and Random Forests as interpolating classifiers. Journal of Machine Learning Research 18(48):1–33

work page 2017

[1] [1]

Bédubourg and Y

G. Bédubourg and Y. Le Strat. 2017. Evaluation and comparison of statistical methods for early temporal detection of outbreaks: A simulation-based study. PLOS ONE 12(7):1–18

work page 2017

[2] [2]

Burkom, L

H. Burkom, L. Ramac-Thomas, S. Babin, R. Holtry, Z. Mnatsakanyan, and C. Yund

work page

[3] [3]

Statistics in Medicine 30(5):470–479

An integrated approach for fusion of environmental and human health data for disease surveillance. Statistics in Medicine 30(5):470–479

work page

[4] [4]

Chakraborty, P

P. Chakraborty, P. Khadivi, B. Lewis, A. Mahendiran, J. Chen, P. Butler, E. Nsoesie, S. Mekaru, J. Brownstein, M. Marathe, and N. Ramakrishnan. 2014. Forecasting a moving target: Ensemble models for ILI case count predictions. In Proceedings of the SIAM International Conference on Data Mining . 262–270

work page 2014

[5] [5]

Farrow, L

D. Farrow, L. Brooks, S. Hyun, R. J. Tibshirani, D. Burke, and R. Rosenfeld. 2017. A human judgment approach to epidemiological forecasting. PLOS Computational Biology 13(3):1–19

work page 2017

[6] [6]

T. Fawcett. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27(8):861–874

work page 2006

[7] [7]

Fernández-Delgado, E

M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research 15:3133–3181

work page 2014

[8] [8]

Fricker Jr., B

R. Fricker Jr., B. Hegler, and D. Dunfee. 2008. Comparing syndromic surveillance detection methods: EARS’ versus a CUSUM-based methodology. Statistics in Medicine 27(17):3407–3429

work page 2008

[9] [9]

Hutwagner, T

L. Hutwagner, T. Browne, G. Seeman, and A. Fleischauer. 2005. Comparing aberration detection methods with simulated data. Journal of Emerging Infectious Diseases 11(2):314–316

work page 2005

[10] [10]

Hutwagner, W

L. Hutwagner, W. Thompson, G. Seeman, and T. Treadwell. 2003. The bioterrorism preparedness and response early aberration reporting system (EARS). Journal of Urban Health 80(1):i89–i96

work page 2003

[11] [11]

Jackson, A

M. Jackson, A. Baer, I. Painter, and J. Duchin. 2007. A simulation study compar- ing aberration detection algorithms for syndromic surveillance. BMC Medical Informatics and Decision Making 7(1):6

work page 2007

[12] [12]

Jafarpour, M

N. Jafarpour, M. Izadi, D. Precup, and D. L. Buckeridge. 2015. Quantifying the determinants of outbreak detection performance through simulation and machine learning. Journal of Biomedical Informatics 53:180–187

work page 2015

[13] [13]

Jafarpour, D

N. Jafarpour, D. Precup, M. Izadi, and D. Buckeridge. 2013. Using hierarchical mixture of experts model for fusion of outbreak detection methods.AMIA Annual Symposium Proceedings 2013:663–669

work page 2013

[14] [14]

Jordan and R

M. Jordan and R. Jacobs. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Computation 6(2):181–214

work page 1994

[15] [15]

Khaleghi, A

B. Khaleghi, A. Khamis, F. Karray, and S. Razavi. 2013. Multisensor data fusion: A review of the state-of-the-art. Information Fusion 14(1):28–44

work page 2013

[16] [16]

Kleinman and A

K. Kleinman and A. Abrams. 2006. Assessing surveillance using sensitivity, specificity and timeliness. Statistical Methods in Medical Research 15(5):445–464

work page 2006

[17] [17]

E. Lau, B. Cowling, L. Ho, and G. Leung. 2008. Optimizing use of multistream influenza sentinel surveillance data. Journal of Emerging Infectious Diseases 14:1154–1157

work page 2008

[18] [18]

H. Ma, A. Bandos, H. Rockette, and D. Gur. 2013. On use of partial area under the ROC curve for evaluation of diagnostic performance. Statistics in Medicine 32(20):3449–3458

work page 2013

[19] [19]

Mnatsakanyan, H

Z. Mnatsakanyan, H. Burkom, J. Coberly, and J. Lombardo. 2009. Bayesian information fusion networks for biosurveillance applications. Journal of the American Medical Informatics Association 16(6):855–863

work page 2009

[20] [20]

Noufaily, D

A. Noufaily, D. Enki, P. Farrington, P. Garthwaite, N. Andrews, and A. Charlett

work page

[21] [21]

Statistics in Medicine 32(7):1206–1222

An improved algorithm for outbreak detection in multiple surveillance systems. Statistics in Medicine 32(7):1206–1222

work page

[22] [22]

Noufaily, R

A. Noufaily, R. Morbey, F. Colón-González, A. Elliot, G. Smith, I. Lake, and N. McCarthy. 2019. Comparison of statistical algorithms for daily syndromic surveillance aberration detection. Bioinformatics. In press

work page 2019

[23] [23]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–2830

work page 2011

[24] [24]

Salmon, D

M. Salmon, D. Schumacher, and M. Höhle. 2016. Monitoring count time series in R: Aberration detection in public health surveillance. Journal of Statistical Software 70(10):1–35

work page 2016

[25] [25]

Shmueli and H

G. Shmueli and H. Burkom. 2010. Statistical challenges facing early outbreak detection in biosurveillance. Technometrics 52(1):39–51

work page 2010

[26] [26]

Texier, R

G. Texier, R. Allodji, L. Diop, J. Meynard, L. Pellegrin, and H. Chaudet. 2019. Using decision fusion methods to improve outbreak detection in disease surveillance. BMC Medical Informatics and Decision Making 19(1):38

work page 2019

[27] [27]

Ting and I

K. Ting and I. Witten. 1999. Issues in stacked generalization. Journal of Artificial Intelligence Research 10:271–289

work page 1999

[28] [28]

D. Wolpert. 1992. Stacked generalization. Neural Networks 5(2):241–259

work page 1992

[29] [29]

Wyner, M

A. Wyner, M. Olson, J. Bleich, and D. Mease. 2017. Explaining the success of AdaBoost and Random Forests as interpolating classifiers. Journal of Machine Learning Research 18(48):1–33

work page 2017