Noise Injection: Improving Out-of-Distribution Generalization for Limited Size Datasets
Pith reviewed 2026-05-18 00:27 UTC · model grok-4.3
The pith
Injecting Gaussian, Speckle, Poisson or Salt-and-Pepper noise during training shrinks the ID-OOD performance gap in chest X-ray COVID detection from 0.10-0.20 down to 0.01-0.06.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that fundamental noise injection techniques (Gaussian, Speckle, Poisson, and Salt and Pepper) applied during training on limited-size chest X-ray datasets can significantly reduce the performance gap between in-distribution and out-of-distribution evaluation from 0.10-0.20 to 0.01-0.06 across AUC, F1, accuracy, recall and specificity for COVID-19 detection.
What carries the argument
Noise injection of Gaussian, Speckle, Poisson, and Salt-and-Pepper types added to training images to disrupt learning of source-specific shortcuts.
Load-bearing premise
The observed gap reduction comes from reduced shortcut learning rather than from the noise introducing new dataset-specific effects that happen to help only on the tested distributions.
What would settle it
Measure whether the gap remains small when the same noise-trained models are evaluated on an entirely new clinical source never seen during any experiment.
read the original abstract
Deep learned (DL) models for image recognition have been shown to fail to generalize to data from different devices, populations, etc. COVID-19 detection from Chest X-rays (CXRs), in particular, has been shown to fail to generalize to out-of-distribution (OOD) data from new clinical sources not covered in the training set. This occurs because models learn to exploit shortcuts - source-specific artifacts that do not translate to new distributions - rather than reasonable biomarkers to maximize performance on in-distribution (ID) data. Rendering the models more robust to distribution shifts, our study investigates the use of fundamental noise injection techniques (Gaussian, Speckle, Poisson, and Salt and Pepper) during training. Our empirical results demonstrate that this technique can significantly reduce the performance gap between ID and OOD evaluation from 0.10-0.20 to 0.01-0.06, based on results averaged over ten random seeds across key metrics such as AUC, F1, accuracy, recall and specificity. Our source code is publicly available at https://github.com/Duongmai127/Noisy-ood
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that injecting fundamental noise types (Gaussian, Speckle, Poisson, Salt and Pepper) during training of deep learning models for COVID-19 detection on chest X-rays substantially improves out-of-distribution generalization on limited-size datasets. It reports that this reduces the ID-OOD performance gap from 0.10-0.20 down to 0.01-0.06 across AUC, F1, accuracy, recall and specificity, with all results averaged over ten random seeds; public code is provided.
Significance. If the empirical gap reduction is shown to be robust and mechanism-driven rather than dataset-specific, the work would supply a simple, low-overhead regularization strategy for mitigating shortcut learning in medical imaging tasks where training data are scarce and distribution shifts are common. The public code release aids reproducibility and could encourage follow-up studies.
major comments (3)
- [Abstract] Abstract: the central gap-reduction claim (0.10-0.20 to 0.01-0.06) is presented without any description of the precise noise parameters (e.g., variance or intensity for each type), the base model architecture, or the exact train/ID/OOD splits; these details are load-bearing for determining whether the observed improvement is reproducible or an artifact of post-hoc tuning.
- [Experimental results] Experimental results section: no ablation isolates noise injection from generic regularization (e.g., equivalent-strength dropout or data augmentation without noise), nor are feature-attribution or saliency analyses provided to test whether shortcuts are actually suppressed versus new noise-induced correlations being learned; this directly affects the interpretation that the gap closure stems from reduced source-specific artifacts.
- [Discussion] Discussion or OOD evaluation: the OOD sets are drawn from other clinical sources of the same modality; without an external hold-out corpus from different hardware or acquisition protocols, it remains unclear whether the reported robustness would hold for truly unseen distributions, which is central to the generalization claim.
minor comments (2)
- [Abstract] The abstract refers to 'limited size datasets' but does not report the actual number of images or patients per split; adding these numbers would improve context.
- Consider adding a table that lists per-noise-type and per-metric ID and OOD scores (with standard deviations across the ten seeds) rather than only the aggregated gap ranges.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We address each of the major comments below and outline the changes we will make to improve the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central gap-reduction claim (0.10-0.20 to 0.01-0.06) is presented without any description of the precise noise parameters (e.g., variance or intensity for each type), the base model architecture, or the exact train/ID/OOD splits; these details are load-bearing for determining whether the observed improvement is reproducible or an artifact of post-hoc tuning.
Authors: We agree that including these details in the abstract would improve clarity and reproducibility. In the revised version, we will update the abstract to briefly specify the noise parameters used for each type, the base model architecture, and the dataset splits for train, ID, and OOD evaluations. These specifics are detailed in the Methods and Experimental Setup sections of the manuscript. revision: yes
-
Referee: [Experimental results] Experimental results section: no ablation isolates noise injection from generic regularization (e.g., equivalent-strength dropout or data augmentation without noise), nor are feature-attribution or saliency analyses provided to test whether shortcuts are actually suppressed versus new noise-induced correlations being learned; this directly affects the interpretation that the gap closure stems from reduced source-specific artifacts.
Authors: This is a valid point. To strengthen the evidence that noise injection specifically reduces shortcut learning, we will add ablation studies comparing our noise injection approach to standard regularization techniques such as dropout and basic data augmentation. Furthermore, we will incorporate saliency map analyses (e.g., using Grad-CAM) on both baseline and noise-injected models to visualize whether the learned features shift towards more robust biomarkers. These additions will be included in the revised Experimental Results section. revision: yes
-
Referee: [Discussion] Discussion or OOD evaluation: the OOD sets are drawn from other clinical sources of the same modality; without an external hold-out corpus from different hardware or acquisition protocols, it remains unclear whether the reported robustness would hold for truly unseen distributions, which is central to the generalization claim.
Authors: We acknowledge that the OOD datasets, while from different clinical sources, may share some similarities in acquisition protocols. In the revised manuscript, we will expand the Discussion section to explicitly discuss this limitation and emphasize that our results demonstrate robustness to shifts across different clinical sources. We will also outline plans for future validation on datasets with more divergent hardware characteristics to further test the generalization claims. revision: partial
Circularity Check
No circularity: purely empirical measurements with no derivation chain
full rationale
This is an empirical study reporting measured performance gaps on ID vs OOD CXR datasets after applying standard noise injection during training. No equations, first-principles derivations, or predictions are claimed; results are presented as averaged experimental outcomes over 10 seeds. The central claim (gap reduction from 0.10-0.20 to 0.01-0.06) is a direct report of observed metrics rather than anything that reduces to fitted inputs or self-citations by construction. No load-bearing steps exist that could be circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard noise models (Gaussian, Speckle, Poisson, Salt-and-Pepper) can be applied directly to training images without altering the underlying data distribution in a way that harms OOD performance.
Reference graph
Works this paper leans on
-
[1]
Out-of-distribution detection in medical image analysis: A survey,
Hong, Z., Yue, Y., Chen, Y., Cong, L., Lin, H., Luo, Y., Wang, M. H., Wang, W., Xu, J., Yang, X., et al., “Out-of-distribution detection in medical image analysis: A survey,”arXiv preprint arXiv:2404.18279 (2024)
-
[2]
L´ opez-Cabrera, J. D., Orozco-Morales, R., Portal-D´ ıaz, J. A., Lovelle-Enr´ ıquez, O., and P´ erez-D´ ıaz, M., “Current limitations to identify covid-19 using artificial intelligence with chest x-ray imaging (part ii). the shortcut learning problem,”Health and technology11(6), 1331–1345 (2021)
work page 2021
-
[3]
Ahmed, K. B., Goldgof, G. M., Paul, R., Goldgof, D. B., and Hall, L. O., “Discovery of a generalization gap of convolutional neural networks on covid-19 x-rays classification,”Ieee Access9, 72970–72979 (2021)
work page 2021
-
[4]
Momeny, M., Neshat, A. A., Hussain, M. A., Kia, S., Marhamati, M., Jahanbakhshi, A., and Hamarneh, G., “Learning-to-augment strategy using noisy and denoised data: Improving generalizability of deep cnn for the detection of covid-19 in x-ray images,”Computers in Biology and Medicine136, 104704 (2021)
work page 2021
-
[5]
Data augmentation in training cnns: injecting noise to images,
Akbiyik, M. E., “Data augmentation in training cnns: injecting noise to images,”arXiv preprint arXiv:2307.06855(2023)
-
[6]
Openmibood: Open medical imaging bench- marks for out-of-distribution detection,
Gutbrod, M., Rauber, D., Nunes, D. W., and Palm, C., “Openmibood: Open medical imaging bench- marks for out-of-distribution detection,” in [Proceedings of the Computer Vision and Pattern Recognition Conference], 25874–25886 (2025)
work page 2025
-
[7]
BIMCV COVID-19+: A large annotated dataset of RX and CT images from COVID-19 patients,
Vay´ a, M. D. L. I., Saborit, J. M., Montell, J. A., Pertusa, A., Bustos, A., Cazorla, M., Galant, J., Barber, X., Orozco-Beltr´ an, D., Garc´ ıa-Garc´ ıa, F., et al., “Bimcv covid-19+: A large annotated dataset of rx and ct images from covid-19 patients,”arXiv preprint arXiv:2006.01174(2020)
-
[8]
Padchest: A large chest x-ray image dataset with multi-label annotated reports,
Bustos, A., Pertusa, A., Salinas, J.-M., and De La Iglesia-Vaya, M., “Padchest: A large chest x-ray image dataset with multi-label annotated reports,”Medical image analysis66, 101797 (2020)
work page 2020
-
[9]
Chest imaging representing a covid-19 positive rural us population,
Desai, S., Baghal, A., Wongsurawat, T., Jenjaroenpun, P., Powell, T., Al-Shukri, S., Gates, K., Farmer, P., Rutherford, M., Blake, G., et al., “Chest imaging representing a covid-19 positive rural us population,” Scientific data7(1), 414 (2020)
work page 2020
-
[10]
Winther, H. B., Laser, H., Gerbel, S., Maschke, S. K., Hinrichs, J. B., Vogel-Claussen, J., Wacker, F. K., H¨ oper, M. M., and Meyer, B. C., “COVID-19 Image Repository,” (5 2020)
work page 2020
-
[11]
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., and Summers, R. M., “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in [Proceedings of the IEEE conference on computer vision and pattern recognition], 2097–2106 (2017)
work page 2097
-
[12]
Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,
Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” in [Proceedings of the AAAI conference on artificial intelligence],33(01), 590–597 (2019)
work page 2019
-
[13]
Gaggion, N., Mansilla, L., Mosquera, C., Milone, D. H., and Ferrante, E., “Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis,”IEEE Transactions on Medical Imaging42(2), 546–556 (2022)
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.