Noise Injection: Improving Out-of-Distribution Generalization for Limited Size Datasets

Duong Mai; Lawrence Hall

arxiv: 2511.03855 · v2 · submitted 2025-11-05 · 💻 cs.CV · cs.AI

Noise Injection: Improving Out-of-Distribution Generalization for Limited Size Datasets

Duong Mai , Lawrence Hall This is my paper

Pith reviewed 2026-05-18 00:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords noise injectionout-of-distribution generalizationchest X-rayCOVID-19 detectionshortcut learningmedical imagingrobustness

0 comments

The pith

Injecting Gaussian, Speckle, Poisson or Salt-and-Pepper noise during training shrinks the ID-OOD performance gap in chest X-ray COVID detection from 0.10-0.20 down to 0.01-0.06.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep models for COVID-19 detection from chest X-rays often exploit source-specific artifacts that fail on images from new hospitals or scanners. The paper tests whether adding four standard noise types to training images can force the models to rely on more stable disease markers instead. Experiments averaged over ten random seeds show the ID-OOD gap across AUC, F1, accuracy, recall and specificity falls sharply when noise is injected. The approach requires no extra data and works on limited-size datasets. Results suggest a lightweight way to improve robustness without architectural changes.

Core claim

The central claim is that fundamental noise injection techniques (Gaussian, Speckle, Poisson, and Salt and Pepper) applied during training on limited-size chest X-ray datasets can significantly reduce the performance gap between in-distribution and out-of-distribution evaluation from 0.10-0.20 to 0.01-0.06 across AUC, F1, accuracy, recall and specificity for COVID-19 detection.

What carries the argument

Noise injection of Gaussian, Speckle, Poisson, and Salt-and-Pepper types added to training images to disrupt learning of source-specific shortcuts.

Load-bearing premise

The observed gap reduction comes from reduced shortcut learning rather than from the noise introducing new dataset-specific effects that happen to help only on the tested distributions.

What would settle it

Measure whether the gap remains small when the same noise-trained models are evaluated on an entirely new clinical source never seen during any experiment.

read the original abstract

Deep learned (DL) models for image recognition have been shown to fail to generalize to data from different devices, populations, etc. COVID-19 detection from Chest X-rays (CXRs), in particular, has been shown to fail to generalize to out-of-distribution (OOD) data from new clinical sources not covered in the training set. This occurs because models learn to exploit shortcuts - source-specific artifacts that do not translate to new distributions - rather than reasonable biomarkers to maximize performance on in-distribution (ID) data. Rendering the models more robust to distribution shifts, our study investigates the use of fundamental noise injection techniques (Gaussian, Speckle, Poisson, and Salt and Pepper) during training. Our empirical results demonstrate that this technique can significantly reduce the performance gap between ID and OOD evaluation from 0.10-0.20 to 0.01-0.06, based on results averaged over ten random seeds across key metrics such as AUC, F1, accuracy, recall and specificity. Our source code is publicly available at https://github.com/Duongmai127/Noisy-ood

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that injecting fundamental noise types (Gaussian, Speckle, Poisson, Salt and Pepper) during training of deep learning models for COVID-19 detection on chest X-rays substantially improves out-of-distribution generalization on limited-size datasets. It reports that this reduces the ID-OOD performance gap from 0.10-0.20 down to 0.01-0.06 across AUC, F1, accuracy, recall and specificity, with all results averaged over ten random seeds; public code is provided.

Significance. If the empirical gap reduction is shown to be robust and mechanism-driven rather than dataset-specific, the work would supply a simple, low-overhead regularization strategy for mitigating shortcut learning in medical imaging tasks where training data are scarce and distribution shifts are common. The public code release aids reproducibility and could encourage follow-up studies.

major comments (3)

[Abstract] Abstract: the central gap-reduction claim (0.10-0.20 to 0.01-0.06) is presented without any description of the precise noise parameters (e.g., variance or intensity for each type), the base model architecture, or the exact train/ID/OOD splits; these details are load-bearing for determining whether the observed improvement is reproducible or an artifact of post-hoc tuning.
[Experimental results] Experimental results section: no ablation isolates noise injection from generic regularization (e.g., equivalent-strength dropout or data augmentation without noise), nor are feature-attribution or saliency analyses provided to test whether shortcuts are actually suppressed versus new noise-induced correlations being learned; this directly affects the interpretation that the gap closure stems from reduced source-specific artifacts.
[Discussion] Discussion or OOD evaluation: the OOD sets are drawn from other clinical sources of the same modality; without an external hold-out corpus from different hardware or acquisition protocols, it remains unclear whether the reported robustness would hold for truly unseen distributions, which is central to the generalization claim.

minor comments (2)

[Abstract] The abstract refers to 'limited size datasets' but does not report the actual number of images or patients per split; adding these numbers would improve context.
Consider adding a table that lists per-noise-type and per-metric ID and OOD scores (with standard deviations across the ten seeds) rather than only the aggregated gap ranges.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each of the major comments below and outline the changes we will make to improve the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the central gap-reduction claim (0.10-0.20 to 0.01-0.06) is presented without any description of the precise noise parameters (e.g., variance or intensity for each type), the base model architecture, or the exact train/ID/OOD splits; these details are load-bearing for determining whether the observed improvement is reproducible or an artifact of post-hoc tuning.

Authors: We agree that including these details in the abstract would improve clarity and reproducibility. In the revised version, we will update the abstract to briefly specify the noise parameters used for each type, the base model architecture, and the dataset splits for train, ID, and OOD evaluations. These specifics are detailed in the Methods and Experimental Setup sections of the manuscript. revision: yes
Referee: [Experimental results] Experimental results section: no ablation isolates noise injection from generic regularization (e.g., equivalent-strength dropout or data augmentation without noise), nor are feature-attribution or saliency analyses provided to test whether shortcuts are actually suppressed versus new noise-induced correlations being learned; this directly affects the interpretation that the gap closure stems from reduced source-specific artifacts.

Authors: This is a valid point. To strengthen the evidence that noise injection specifically reduces shortcut learning, we will add ablation studies comparing our noise injection approach to standard regularization techniques such as dropout and basic data augmentation. Furthermore, we will incorporate saliency map analyses (e.g., using Grad-CAM) on both baseline and noise-injected models to visualize whether the learned features shift towards more robust biomarkers. These additions will be included in the revised Experimental Results section. revision: yes
Referee: [Discussion] Discussion or OOD evaluation: the OOD sets are drawn from other clinical sources of the same modality; without an external hold-out corpus from different hardware or acquisition protocols, it remains unclear whether the reported robustness would hold for truly unseen distributions, which is central to the generalization claim.

Authors: We acknowledge that the OOD datasets, while from different clinical sources, may share some similarities in acquisition protocols. In the revised manuscript, we will expand the Discussion section to explicitly discuss this limitation and emphasize that our results demonstrate robustness to shifts across different clinical sources. We will also outline plans for future validation on datasets with more divergent hardware characteristics to further test the generalization claims. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivation chain

full rationale

This is an empirical study reporting measured performance gaps on ID vs OOD CXR datasets after applying standard noise injection during training. No equations, first-principles derivations, or predictions are claimed; results are presented as averaged experimental outcomes over 10 seeds. The central claim (gap reduction from 0.10-0.20 to 0.01-0.06) is a direct report of observed metrics rather than anything that reduces to fitted inputs or self-citations by construction. No load-bearing steps exist that could be circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that standard noise distributions act as effective regularizers against shortcut learning without requiring new theoretical justification or additional fitted parameters beyond the noise type and intensity choices.

axioms (1)

domain assumption Standard noise models (Gaussian, Speckle, Poisson, Salt-and-Pepper) can be applied directly to training images without altering the underlying data distribution in a way that harms OOD performance.
Invoked implicitly when claiming gap reduction is due to reduced shortcut exploitation.

pith-pipeline@v0.9.0 · 5719 in / 1202 out tokens · 29929 ms · 2026-05-18T00:27:54.555721+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Out-of-distribution detection in medical image analysis: A survey,

Hong, Z., Yue, Y., Chen, Y., Cong, L., Lin, H., Luo, Y., Wang, M. H., Wang, W., Xu, J., Yang, X., et al., “Out-of-distribution detection in medical image analysis: A survey,”arXiv preprint arXiv:2404.18279 (2024)

work page arXiv 2024
[2]

Current limitations to identify covid-19 using artificial intelligence with chest x-ray imaging (part ii). the shortcut learning problem,

L´ opez-Cabrera, J. D., Orozco-Morales, R., Portal-D´ ıaz, J. A., Lovelle-Enr´ ıquez, O., and P´ erez-D´ ıaz, M., “Current limitations to identify covid-19 using artificial intelligence with chest x-ray imaging (part ii). the shortcut learning problem,”Health and technology11(6), 1331–1345 (2021)

work page 2021
[3]

Discovery of a generalization gap of convolutional neural networks on covid-19 x-rays classification,

Ahmed, K. B., Goldgof, G. M., Paul, R., Goldgof, D. B., and Hall, L. O., “Discovery of a generalization gap of convolutional neural networks on covid-19 x-rays classification,”Ieee Access9, 72970–72979 (2021)

work page 2021
[4]

Learning-to-augment strategy using noisy and denoised data: Improving generalizability of deep cnn for the detection of covid-19 in x-ray images,

Momeny, M., Neshat, A. A., Hussain, M. A., Kia, S., Marhamati, M., Jahanbakhshi, A., and Hamarneh, G., “Learning-to-augment strategy using noisy and denoised data: Improving generalizability of deep cnn for the detection of covid-19 in x-ray images,”Computers in Biology and Medicine136, 104704 (2021)

work page 2021
[5]

Data augmentation in training cnns: injecting noise to images,

Akbiyik, M. E., “Data augmentation in training cnns: injecting noise to images,”arXiv preprint arXiv:2307.06855(2023)

work page arXiv 2023
[6]

Openmibood: Open medical imaging bench- marks for out-of-distribution detection,

Gutbrod, M., Rauber, D., Nunes, D. W., and Palm, C., “Openmibood: Open medical imaging bench- marks for out-of-distribution detection,” in [Proceedings of the Computer Vision and Pattern Recognition Conference], 25874–25886 (2025)

work page 2025
[7]

BIMCV COVID-19+: A large annotated dataset of RX and CT images from COVID-19 patients,

Vay´ a, M. D. L. I., Saborit, J. M., Montell, J. A., Pertusa, A., Bustos, A., Cazorla, M., Galant, J., Barber, X., Orozco-Beltr´ an, D., Garc´ ıa-Garc´ ıa, F., et al., “Bimcv covid-19+: A large annotated dataset of rx and ct images from covid-19 patients,”arXiv preprint arXiv:2006.01174(2020)

work page arXiv 2006
[8]

Padchest: A large chest x-ray image dataset with multi-label annotated reports,

Bustos, A., Pertusa, A., Salinas, J.-M., and De La Iglesia-Vaya, M., “Padchest: A large chest x-ray image dataset with multi-label annotated reports,”Medical image analysis66, 101797 (2020)

work page 2020
[9]

Chest imaging representing a covid-19 positive rural us population,

Desai, S., Baghal, A., Wongsurawat, T., Jenjaroenpun, P., Powell, T., Al-Shukri, S., Gates, K., Farmer, P., Rutherford, M., Blake, G., et al., “Chest imaging representing a covid-19 positive rural us population,” Scientific data7(1), 414 (2020)

work page 2020
[10]

COVID-19 Image Repository,

Winther, H. B., Laser, H., Gerbel, S., Maschke, S. K., Hinrichs, J. B., Vogel-Claussen, J., Wacker, F. K., H¨ oper, M. M., and Meyer, B. C., “COVID-19 Image Repository,” (5 2020)

work page 2020
[11]

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,

Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., and Summers, R. M., “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in [Proceedings of the IEEE conference on computer vision and pattern recognition], 2097–2106 (2017)

work page 2097
[12]

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,

Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” in [Proceedings of the AAAI conference on artificial intelligence],33(01), 590–597 (2019)

work page 2019
[13]

Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis,

Gaggion, N., Mansilla, L., Mosquera, C., Milone, D. H., and Ferrante, E., “Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis,”IEEE Transactions on Medical Imaging42(2), 546–556 (2022)

work page 2022

[1] [1]

Out-of-distribution detection in medical image analysis: A survey,

Hong, Z., Yue, Y., Chen, Y., Cong, L., Lin, H., Luo, Y., Wang, M. H., Wang, W., Xu, J., Yang, X., et al., “Out-of-distribution detection in medical image analysis: A survey,”arXiv preprint arXiv:2404.18279 (2024)

work page arXiv 2024

[2] [2]

Current limitations to identify covid-19 using artificial intelligence with chest x-ray imaging (part ii). the shortcut learning problem,

L´ opez-Cabrera, J. D., Orozco-Morales, R., Portal-D´ ıaz, J. A., Lovelle-Enr´ ıquez, O., and P´ erez-D´ ıaz, M., “Current limitations to identify covid-19 using artificial intelligence with chest x-ray imaging (part ii). the shortcut learning problem,”Health and technology11(6), 1331–1345 (2021)

work page 2021

[3] [3]

Discovery of a generalization gap of convolutional neural networks on covid-19 x-rays classification,

Ahmed, K. B., Goldgof, G. M., Paul, R., Goldgof, D. B., and Hall, L. O., “Discovery of a generalization gap of convolutional neural networks on covid-19 x-rays classification,”Ieee Access9, 72970–72979 (2021)

work page 2021

[4] [4]

Learning-to-augment strategy using noisy and denoised data: Improving generalizability of deep cnn for the detection of covid-19 in x-ray images,

Momeny, M., Neshat, A. A., Hussain, M. A., Kia, S., Marhamati, M., Jahanbakhshi, A., and Hamarneh, G., “Learning-to-augment strategy using noisy and denoised data: Improving generalizability of deep cnn for the detection of covid-19 in x-ray images,”Computers in Biology and Medicine136, 104704 (2021)

work page 2021

[5] [5]

Data augmentation in training cnns: injecting noise to images,

Akbiyik, M. E., “Data augmentation in training cnns: injecting noise to images,”arXiv preprint arXiv:2307.06855(2023)

work page arXiv 2023

[6] [6]

Openmibood: Open medical imaging bench- marks for out-of-distribution detection,

Gutbrod, M., Rauber, D., Nunes, D. W., and Palm, C., “Openmibood: Open medical imaging bench- marks for out-of-distribution detection,” in [Proceedings of the Computer Vision and Pattern Recognition Conference], 25874–25886 (2025)

work page 2025

[7] [7]

BIMCV COVID-19+: A large annotated dataset of RX and CT images from COVID-19 patients,

Vay´ a, M. D. L. I., Saborit, J. M., Montell, J. A., Pertusa, A., Bustos, A., Cazorla, M., Galant, J., Barber, X., Orozco-Beltr´ an, D., Garc´ ıa-Garc´ ıa, F., et al., “Bimcv covid-19+: A large annotated dataset of rx and ct images from covid-19 patients,”arXiv preprint arXiv:2006.01174(2020)

work page arXiv 2006

[8] [8]

Padchest: A large chest x-ray image dataset with multi-label annotated reports,

Bustos, A., Pertusa, A., Salinas, J.-M., and De La Iglesia-Vaya, M., “Padchest: A large chest x-ray image dataset with multi-label annotated reports,”Medical image analysis66, 101797 (2020)

work page 2020

[9] [9]

Chest imaging representing a covid-19 positive rural us population,

Desai, S., Baghal, A., Wongsurawat, T., Jenjaroenpun, P., Powell, T., Al-Shukri, S., Gates, K., Farmer, P., Rutherford, M., Blake, G., et al., “Chest imaging representing a covid-19 positive rural us population,” Scientific data7(1), 414 (2020)

work page 2020

[10] [10]

COVID-19 Image Repository,

Winther, H. B., Laser, H., Gerbel, S., Maschke, S. K., Hinrichs, J. B., Vogel-Claussen, J., Wacker, F. K., H¨ oper, M. M., and Meyer, B. C., “COVID-19 Image Repository,” (5 2020)

work page 2020

[11] [11]

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,

Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., and Summers, R. M., “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in [Proceedings of the IEEE conference on computer vision and pattern recognition], 2097–2106 (2017)

work page 2097

[12] [12]

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,

Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” in [Proceedings of the AAAI conference on artificial intelligence],33(01), 590–597 (2019)

work page 2019

[13] [13]

Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis,

Gaggion, N., Mansilla, L., Mosquera, C., Milone, D. H., and Ferrante, E., “Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis,”IEEE Transactions on Medical Imaging42(2), 546–556 (2022)

work page 2022