Beyond Accuracy: Evaluating Efficiency, Robustness and Explainability in Deep Learning for Malaria Diagnosis

Kerol Djoumessi; Olivier Kanamugire

arxiv: 2605.30734 · v1 · pith:WUOUK7GNnew · submitted 2026-05-29 · 💻 cs.LG · cs.CV

Beyond Accuracy: Evaluating Efficiency, Robustness and Explainability in Deep Learning for Malaria Diagnosis

Olivier Kanamugire , Kerol Djoumessi This is my paper

Pith reviewed 2026-06-28 23:24 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords malaria diagnosisdeep learningexplainable AImodel efficiencyrobustness evaluationimage corruptionblood smear analysislightweight models

0 comments

The pith

Lightweight deep learning models match heavier ones in malaria diagnosis accuracy on the NLM-Malaria dataset with no statistically significant differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks four deep learning models of different sizes and architectures for identifying malaria parasites in blood smear images. It finds that efficient lightweight models perform as well as heavier ones, confirmed by the Friedman test showing no significant performance gaps. CAM-based explanation methods reliably highlight relevant parasite regions, while finer attribution techniques are less focused especially on larger models. When images are corrupted in ways that mimic real conditions, model confidence falls quicker than raw accuracy, and all tested explanation methods lose reliability at noise levels that could occur in practice.

Core claim

Lightweight, efficient-by-design models match their heavier counterparts in predictive performance on the NLM-Malaria dataset, with the Friedman test confirming no statistically significant differences. CAM-based XAI methods consistently localize diagnostically relevant regions, whereas fine-grained attribution methods produce less targeted explanations particularly with heavier backbones. Robustness tests under three image corruption types show model confidence degrades faster than accuracy, offering a potential signal for human review, yet no XAI method remains robust as explanation quality drops at corruption levels plausible in clinical settings even when predictions stay accurate.

What carries the argument

Joint benchmarking of four models across efficiency, robustness to three image corruptions, and post-hoc explainability via CAM and fine-grained attribution methods.

If this is right

Lightweight models support deployment in resource-constrained settings without loss of predictive performance.
CAM-based methods are preferable over fine-grained attribution for producing targeted explanations of malaria predictions.
Monitoring model confidence can flag images for human review when corruption may be present.
Explanation reliability must be separately validated because it can fail while accuracy holds under realistic image noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Clinical rollout would benefit from additional tests on locally collected images to verify the performance equivalence holds outside the NLM-Malaria set.
Pairing lightweight models with confidence thresholds could create a practical human-in-the-loop workflow for malaria screening.
Developing training techniques that preserve explanation quality under common degradations like blur or noise would strengthen reliability for real-world use.

Load-bearing premise

The NLM-Malaria dataset together with the three chosen corruption types sufficiently represent the variability and noise encountered in actual clinical malaria diagnosis workflows in resource-constrained settings.

What would settle it

A new collection of blood-smear images from field clinics showing either a statistically significant accuracy gap favoring heavier models or stable XAI performance under the same corruption types.

Figures

Figures reproduced from arXiv: 2605.30734 by Kerol Djoumessi, Olivier Kanamugire.

**Figure 1.** Figure 1: Overview of the proposed framework. A blood smear image is processed by a DNN backbone for binary classification, followed by a post-hoc XAI method that generates saliency maps highlighting the image regions most influential to the prediction. The proposed framework combines a deep learning classification pipeline with post-hoc explainability ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the NLM-Malaria dataset. Representative examples of uninfected (left) and infected (center) blood smear cells, alongside the training and test set class distributions (right). The training set contains 27, 560 images (13, 780 infected and 13, 780 uninfected), while the test set contains 15, 832 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy and parameter count across architectures. Bars (y-axis) show mean cross-validated accuracy; the overlaid line (x-axis) indicates the number of trainable parameters. Computational costs. Beyond the accuracy–parameter trade-off, model efficiency is evaluated using floating-point operations (FLOPs), per-image inference latency (mean ± standard deviation over 500 runs), and peak CPU memory consumption… view at source ↗

**Figure 5.** Figure 5: Performance under different noise types across architec [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 4.** Figure 4: Example of an infected blood smear cell images across [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Saliency-based explanation maps across CNN architec [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Disagreement maps on MobileNet [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 10.** Figure 10: ViT explanation using Grad-CAM. 4.6 Quantitative explanation analysis To quantitatively evaluate explanation quality, insertion and deletion Gomez et al. (2022) are adopted as a complementary quantitative metrics. Insertion progressively reveals image regions in descending order of attributed importance and measures the corresponding increase in predicted class confidence, while deletion progressively … view at source ↗

**Figure 9.** Figure 9: Grad-CAM attribution maps for EfficientNet under increasing levels of perturbation. The first row represents the mild level, while the second row represents the severe level under various perturbations (Gaussian, Salt & Pepper, and Blur). Qualitative explanations for ViT. On the Vision Transformer model, Grad-CAM is adapted to analyze the spatial regions contributing to predictions ( [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 12.** Figure 12: Stability of explanations under perturbations. [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

**Figure 15.** Figure 15: MobileNet learning curves A.4 ViT b 16 [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

**Figure 16.** Figure 16: ViT learning curves [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗

**Figure 17.** Figure 17: Disagreement maps across different model for [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗

**Figure 18.** Figure 18: Disagreement maps across different model for [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

**Figure 19.** Figure 19: Mobile Model [PITH_FULL_IMAGE:figures/full_fig_p015_19.png] view at source ↗

**Figure 21.** Figure 21: EfficientNet Model [PITH_FULL_IMAGE:figures/full_fig_p016_21.png] view at source ↗

**Figure 22.** Figure 22: EfficientNet: insertion curves for different post-hoc XAI methods with random baseline. [PITH_FULL_IMAGE:figures/full_fig_p018_22.png] view at source ↗

**Figure 23.** Figure 23: ResNet: insertion curves for different post-hoc XAI methods with random baseline. [PITH_FULL_IMAGE:figures/full_fig_p018_23.png] view at source ↗

**Figure 24.** Figure 24: MobileNet: insertion curves for different post-hoc XAI methods with random baseline. [PITH_FULL_IMAGE:figures/full_fig_p018_24.png] view at source ↗

read the original abstract

Malaria remains a leading cause of mortality in sub-Saharan Africa, where scarce diagnostic infrastructure makes timely, accurate diagnosis particularly challenging. While deep learning offers a compelling path toward automated malaria screening, clinical adoption is hindered by computational cost and opacity in decision-making. This work benchmarks four deep learning models spanning a wide range of designed design architectures and model capacities on the NLM-Malaria dataset, jointly evaluating predictive performance, robustness, and post-hoc explainability. We find that lightweight, efficient-by-design models match their heavier counterparts in predictive performance, and the Friedman test confirms no statistically significant performance differences. CAM-based XAI methods consistently localize diagnostically relevant regions, while fine-grained attribution methods produce less targeted explanations, particularly with heavier backbones. Robustness evaluation under three types of image corruption further reveals that model confidence degrades faster than accuracy, providing a practical signal for human review. However, no XAI method is robust to corruption, with explanation reliability degrading at noise levels plausible in clinical practice, even when predictions remain accurate. These findings support the deployment of lightweight architectures for malaria diagnosis in resource-constrained settings, while highlighting the vulnerability of post-hoc explanations as an important consideration for responsible clinical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lightweight models match heavy ones on NLM-Malaria with CAM XAI doing better than alternatives, but the three corruptions and single dataset leave real clinical transfer open.

read the letter

The main point is that on this one dataset the smaller, efficient CNNs perform statistically indistinguishably from the larger ones by the Friedman test, CAM-based explanations localize the relevant cells more cleanly than finer attribution methods, and model confidence falls off faster than accuracy under the three corruptions while no explanation method stays stable. That last observation is the most practically useful piece.

The paper does a straightforward job of running the same models through accuracy, efficiency, robustness, and post-hoc XAI in one place. Sticking to off-the-shelf architectures and standard XAI tools keeps the comparison clean and lets the reader see the trade-offs directly. The confidence-versus-accuracy gap under corruption is worth noting for anyone thinking about human-in-the-loop screening.

The soft spot is exactly the one the stress-test note flags. NLM-Malaria plus three synthetic corruptions is a controlled benchmark; it does not automatically stand in for variable staining, microscope artifacts, lighting, or patient demographics in actual sub-Saharan clinics. Without any external clinical images or cross-site check, the equivalence of light and heavy models and the XAI degradation claims could be tied to this particular data distribution. The abstract states the conclusions cleanly, but the strength of the deployment recommendation hinges on how representative the setup really is.

This is for groups that select or adapt models for malaria screening in low-resource settings and want a multi-axis comparison as a reference. A reader already working on medical imaging robustness would get some concrete numbers to think with, though they would still run their own field tests.

It deserves peer review. The evaluation design is transparent enough that referees can check the corruption parameters, the exact XAI implementations, and the statistical test details. The topic has clear stakes even if the generalizability needs more support.

Referee Report

1 major / 2 minor

Summary. The paper benchmarks four deep learning models spanning a range of architectures and capacities on the NLM-Malaria dataset for automated malaria diagnosis. It jointly assesses predictive performance (accuracy and Friedman test for statistical significance), robustness under three types of image corruption, and post-hoc explainability via CAM-based methods versus fine-grained attribution techniques. Central claims are that lightweight efficient-by-design models match heavier counterparts with no statistically significant performance differences, CAM methods consistently localize diagnostically relevant regions while fine-grained methods are less targeted (especially on heavier backbones), model confidence degrades faster than accuracy under corruption (providing a human-review signal), and no XAI method remains robust to corruption even at noise levels where predictions stay accurate. These results are invoked to support deployment of lightweight models in resource-constrained settings while flagging XAI vulnerabilities for responsible clinical use.

Significance. If the empirical findings hold under broader validation, the work is significant for medical imaging AI by demonstrating that efficiency need not trade off against accuracy on this task and by providing a practical robustness signal via confidence. The joint evaluation of efficiency, robustness, and explainability addresses a real gap in clinical translation literature. Explicit use of the Friedman test and the observation that confidence drops precede accuracy loss are concrete strengths that could inform deployment protocols. The paper does not ship machine-checked proofs or parameter-free derivations, but the multi-metric benchmark itself is a useful contribution if the dataset and corruptions prove representative.

major comments (1)

[Abstract] Abstract and deployment recommendation: the claim that results 'support the deployment of lightweight architectures for malaria diagnosis in resource-constrained settings' is load-bearing for the paper's applied conclusion, yet rests on NLM-Malaria plus three unspecified corruptions being representative of field conditions (variable staining, microscope artifacts, lighting, demographics in sub-Saharan Africa). No external validation set or comparison to real clinical images from target settings is described, so equivalence, robustness ordering, and XAI degradation findings could be benchmark-specific artifacts.

minor comments (2)

[Abstract] The abstract refers to 'three types of image corruption' without naming them or their parameters; this detail should appear in the abstract or first paragraph of the methods section for immediate clarity.
The description of CAM versus fine-grained attribution would benefit from a brief statement of the exact methods and backbones used (e.g., Grad-CAM on ResNet vs. MobileNet) rather than generic labels.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback, particularly on the generalizability of our deployment claims. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and deployment recommendation: the claim that results 'support the deployment of lightweight architectures for malaria diagnosis in resource-constrained settings' is load-bearing for the paper's applied conclusion, yet rests on NLM-Malaria plus three unspecified corruptions being representative of field conditions (variable staining, microscope artifacts, lighting, demographics in sub-Saharan Africa). No external validation set or comparison to real clinical images from target settings is described, so equivalence, robustness ordering, and XAI degradation findings could be benchmark-specific artifacts.

Authors: We agree this is a substantive limitation. The NLM-Malaria dataset is a standard but controlled benchmark from a single source and does not capture the full spectrum of real-world clinical variability (e.g., staining inconsistencies, microscope artifacts, lighting conditions, or demographic differences across sub-Saharan Africa). Our three synthetic corruption types are approximations rather than direct matches to field data, and no external validation on target clinical images was performed. Thus, the observed model equivalence, robustness orderings, and XAI degradation patterns could indeed be benchmark-specific. We cannot claim direct support for deployment without further evidence. In revision we will (1) qualify the abstract claim to 'These findings indicate the potential suitability of lightweight architectures for malaria diagnosis in resource-constrained settings, subject to additional validation on diverse clinical data' and (2) add an explicit limitations paragraph in the discussion reiterating the benchmark-specific nature of the results and calling for external validation. This constitutes a partial revision, as new external datasets cannot be introduced without additional data collection. revision: partial

Circularity Check

0 steps flagged

Purely empirical benchmarking with no derivations or self-referential reductions

full rationale

The paper conducts direct experimental comparisons of four DL models on the NLM-Malaria dataset, measuring accuracy, Friedman-test significance, robustness to three image corruptions, and post-hoc XAI localization quality. No equations, parameter fits, or predictive derivations appear; claims follow immediately from reported metrics without reduction to prior results by construction. No self-citations serve as load-bearing uniqueness theorems or ansatzes. The evaluation chain is self-contained against external benchmarks and contains no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no mathematical model, derivations, or parameter fitting described.

pith-pipeline@v0.9.1-grok · 5744 in / 1058 out tokens · 19094 ms · 2026-06-28T23:24:48.866627+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 5 canonical work pages · 4 internal anchors

[1]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[2]

Salvador Garc ´ıa, Alberto Fern ´andez, Juli ´an Luengo, and Francisco Herrera. Advanced nonparametric tests for mul- tiple comparisons in the design of experiments in computa- tional intelligence and data mining: Experimental analysis of power.Information sciences, 180(10):2044–2064,

2044
[3]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and pertur- bations.arXiv preprint arXiv:1903.12261,

work page internal anchor Pith review Pith/arXiv arXiv 1903
[4]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. Mobilenets: Efficient con- volutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Efficient deep learning for medical imaging: Bridging the gap between high-performance ai and clinical deployment.arXiv preprint arXiv:2602.00910,

Cuong Manh Nguyen and Truong-Son Hy. Efficient deep learning for medical imaging: Bridging the gap between high-performance ai and clinical deployment.arXiv preprint arXiv:2602.00910,

work page arXiv
[6]

World Health Organization,

World Health Organization.Global technical strategy for malaria 2016-2030. World Health Organization,

2016
[7]

Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models

Wojciech Samek, Thomas Wiegand, and Klaus-Robert M¨uller. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models.arXiv preprint arXiv:1708.08296,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[2] [2]

Salvador Garc ´ıa, Alberto Fern ´andez, Juli ´an Luengo, and Francisco Herrera. Advanced nonparametric tests for mul- tiple comparisons in the design of experiments in computa- tional intelligence and data mining: Experimental analysis of power.Information sciences, 180(10):2044–2064,

2044

[3] [3]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and pertur- bations.arXiv preprint arXiv:1903.12261,

work page internal anchor Pith review Pith/arXiv arXiv 1903

[4] [4]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. Mobilenets: Efficient con- volutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Efficient deep learning for medical imaging: Bridging the gap between high-performance ai and clinical deployment.arXiv preprint arXiv:2602.00910,

Cuong Manh Nguyen and Truong-Son Hy. Efficient deep learning for medical imaging: Bridging the gap between high-performance ai and clinical deployment.arXiv preprint arXiv:2602.00910,

work page arXiv

[6] [6]

World Health Organization,

World Health Organization.Global technical strategy for malaria 2016-2030. World Health Organization,

2016

[7] [7]

Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models

Wojciech Samek, Thomas Wiegand, and Klaus-Robert M¨uller. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models.arXiv preprint arXiv:1708.08296,

work page internal anchor Pith review Pith/arXiv arXiv