pith. sign in

arxiv: 2507.10492 · v1 · pith:4NUIVK67new · submitted 2025-07-14 · 💻 cs.CV · cs.AI· cs.LG

BenchReAD: A systematic benchmark for retinal anomaly detection

Pith reviewed 2026-05-21 23:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords retinal anomaly detectionmedical image benchmarkdisentangled representationsnormal feature memorygeneralization in anomaly detectionocular disease screening
0
0 comments X

The pith

A new retinal anomaly benchmark shows that adding a normal feature memory to disentangled abnormality representations cuts performance drops on unseen cases and sets a new state of the art.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BenchReAD, a public benchmark with diverse retinal anomaly types and explicit generalization tests, to replace earlier overly simple and saturated evaluation setups. It benchmarks existing methods and shows that fully supervised disentangled representations of abnormalities deliver the strongest results overall yet still degrade sharply on certain unseen anomalies. The authors then combine those representations with a Normal Feature Memory module drawn from one-class learning ideas, creating NFM-DRA that preserves accuracy better across the new test splits. A sympathetic reader would care because retinal anomaly detection supports screening for both eye and systemic diseases, and reliable generalization to rare or novel anomalies matters for real clinical deployment.

Core claim

Retinal anomaly detection has been held back by limited anomaly variety, near-saturated test sets, and missing generalization checks in prior benchmarks. A fully supervised method using disentangled representations of abnormalities reaches the highest scores but shows large drops when tested on certain unseen anomalies. Adding a Normal Feature Memory bank to store and compare normal features mitigates those drops, producing NFM-DRA that establishes a new state of the art on the proposed benchmark.

What carries the argument

NFM-DRA, which augments disentangled representations of abnormalities with a Normal Feature Memory bank to retain performance on unseen retinal anomalies.

If this is right

  • Future retinal anomaly detectors can be evaluated more fairly on varied anomaly types and explicit generalization tasks.
  • Methods that store normal features alongside supervised abnormality modeling become a practical route to better robustness.
  • The public benchmark enables direct comparison of one-class, semi-supervised, and fully supervised approaches under the same conditions.
  • Improved handling of unseen anomalies supports more reliable screening for ocular and systemic diseases in varied patient populations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar memory-augmented disentanglement ideas could transfer to anomaly detection in other medical imaging domains such as chest X-rays or histopathology slides.
  • The benchmark design highlights the value of mixing labeled abnormal samples with unlabeled data, which is common in clinics but rarely used in current one-class setups.
  • If the Normal Feature Memory proves stable, it may reduce the need for constant model retraining when new rare anomalies appear in practice.

Load-bearing premise

The chosen anomaly categories and train-test splits in BenchReAD are representative of real clinical variability and hard enough to expose genuine generalization gaps rather than being too easy or artificial.

What would settle it

If NFM-DRA shows no meaningful reduction in performance drop compared with plain DRA when evaluated on the benchmark's held-out unseen anomaly subsets, the central improvement claim would be refuted.

Figures

Figures reproduced from arXiv: 2507.10492 by Chenyu Lian, Hong-Yu Zhou, Jing Qin, Zhanli Hu.

Figure 1
Figure 1. Figure 1: (a) Comparison among widely used datasets for retinal anomaly detection (AP￾TOS [15], LAG [19], OCT 2017 [16], and RESC [13]) and our BenchReAD. (b) Overview of the test datasets included in the proposed benchmark. lacks a comprehensive and systematic benchmark to standardize evalua￾tion protocols and facilitate the development of novel methods. Existing bench￾marks of medical anomaly detection related to … view at source ↗
Figure 2
Figure 2. Figure 2: ROC curves of distinguishing normal samples against all abnormal ones on test sets. Corresponding AUCs (%) are marked alongside 95% confidence intervals. where NM b (m∗ ) denotes b features in M that are nearest to m∗ . We refine the prediction scores of DRA using anomaly values derived from Equation 2, generating the final anomaly scores: f (X ;M, DRA) = 1 2 [g (X ;M) + DRA (X )] . (3) 3 Experiments and a… view at source ↗
Figure 3
Figure 3. Figure 3: AUROCs with 95% confidence intervals for normal samples compared to each abnormal category on the (a) RIADD and (b) JSIEC datasets for fundus benchmarking, as well as (c) OCT test datasets. Seen and unseen annotations highlight whether the anomaly categories appear in the training set or not, respectively. PatchCore. The proposed NFM-DRA is the best-performing approach, showing notable improvements, partic… view at source ↗
read the original abstract

Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, establishing a new SOTA. The benchmark is publicly available at https://github.com/DopamineLcy/BenchReAD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces BenchReAD, a comprehensive public benchmark for retinal anomaly detection that addresses prior limitations including overly simplistic anomaly sets, saturated test splits, and insufficient generalization evaluation. The authors categorize existing methods, report that a fully supervised disentangled representations of abnormalities (DRA) approach yields the strongest results yet exhibits notable performance degradation on certain unseen anomalies, and propose NFM-DRA which augments DRA with a Normal Feature Memory mechanism to recover performance and establish a new state-of-the-art on the benchmark.

Significance. If the benchmark splits prove representative and the reported gains hold under rigorous statistical scrutiny, this work supplies the first systematic public resource for retinal anomaly detection, enabling standardized comparisons across one-class, supervised, and hybrid paradigms. The explicit contrast to saturated prior setups and the public GitHub release are strengths that support reproducibility and incremental progress toward clinically robust detectors.

major comments (2)
  1. [§4.3 and Table 4] §4.3 and Table 4: the reported performance recovery of NFM-DRA over DRA on unseen-anomaly splits is load-bearing for the central claim of a new SOTA; the manuscript must include per-split standard deviations across at least three random seeds and a statistical significance test (e.g., paired t-test) to confirm the improvement is not attributable to variance.
  2. [§3.2] §3.2: the construction of the Normal Feature Memory is described at a high level; the precise memory-update rule, capacity, and distance metric used during inference must be specified with equations so that the incremental benefit over plain DRA can be reproduced and isolated.
minor comments (3)
  1. [Abstract] Abstract: quantitative metrics (e.g., AUC or F1 deltas) for the DRA drop and NFM-DRA recovery are omitted; adding the key headline numbers would make the contribution immediately verifiable.
  2. [§2 Related Work] §2 Related Work: the discussion of prior retinal benchmarks should explicitly cite the dataset sizes and anomaly counts of the three most-cited works to quantify the claimed improvement in diversity.
  3. [Figure 3] Figure 3: the legend and axis labels are too small for print; increasing font size and adding error bars would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of BenchReAD and the recommendation for minor revision. The two major comments identify important areas for strengthening statistical rigor and reproducibility; we will address both directly in the revised manuscript.

read point-by-point responses
  1. Referee: [§4.3 and Table 4] §4.3 and Table 4: the reported performance recovery of NFM-DRA over DRA on unseen-anomaly splits is load-bearing for the central claim of a new SOTA; the manuscript must include per-split standard deviations across at least three random seeds and a statistical significance test (e.g., paired t-test) to confirm the improvement is not attributable to variance.

    Authors: We agree that statistical validation is necessary to support the central claim. In the revised manuscript we will re-run all experiments on the unseen-anomaly splits for both DRA and NFM-DRA using three independent random seeds. Table 4 will be updated to report mean performance together with standard deviations, and we will add paired t-test p-values comparing NFM-DRA against DRA on each split. These additions will be placed in §4.3 and the corresponding table caption. revision: yes

  2. Referee: [§3.2] §3.2: the construction of the Normal Feature Memory is described at a high level; the precise memory-update rule, capacity, and distance metric used during inference must be specified with equations so that the incremental benefit over plain DRA can be reproduced and isolated.

    Authors: We thank the referee for highlighting the need for precise specification. In the revised §3.2 we will add the exact memory-update equation, state the fixed memory capacity (number of stored normal feature vectors), and define the distance metric (cosine distance) employed at inference time. These details will be presented with numbered equations immediately following the high-level description, enabling direct reproduction and isolation of the NFM contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark and method proposal

full rationale

The paper introduces a new public benchmark for retinal anomaly detection and reports empirical comparisons of methods, finding that a fully supervised DRA approach performs best yet degrades on unseen anomalies, then proposes NFM-DRA by integrating a Normal Feature Memory inspired by prior one-class memory-bank work. No mathematical derivations, first-principles predictions, or equations appear in the abstract or description. Claims rest on experimental results on the released benchmark rather than any reduction of outputs to fitted inputs or self-citations by construction. The benchmark release and explicit contrast to prior saturated setups provide independent verifiability, making the work self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that labeled abnormal retinal images are routinely available in clinical practice and that memory-bank mechanisms from one-class learning transfer usefully to a fully supervised disentangled-representation setting; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Labeled abnormal data and unlabeled data are commonly available in clinical retinal imaging practice
    Invoked to justify moving beyond one-class supervised methods.

pith-pipeline@v0.9.0 · 5785 in / 1349 out tokens · 65390 ms · 2026-05-21T23:17:09.753897+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Bao,J.,Sun,H.,Deng,H.,He,Y.,Zhang,Z.,Li,X.:Bmad:Benchmarksformedical anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4042–4053 (2024)

  2. [2]

    Medical Image Analysis 86, 102794 (2023)

    Cai, Y., Chen, H., Yang, X., Zhou, Y., Cheng, K.T.: Dual-distribution discrepancy with self-supervised refinement for anomaly detection in medical images. Medical Image Analysis 86, 102794 (2023)

  3. [3]

    arXiv preprint arXiv:2404.04518 (2024)

    Cai, Y., Zhang, W., Chen, H., Cheng, K.T.: Medianomaly: A comparative study of anomaly detection in medical images. arXiv preprint arXiv:2404.04518 (2024)

  4. [4]

    Nature communications12(1), 4828 (2021)

    Cen, L.P., Ji, J., Lin, J.W., Ju, S.T., Lin, H.J., Li, T.P., Wang, Y., Yang, J.F., Liu, Y.F., Tan, S., et al.: Automatic detection of 39 fundus diseases and conditions in retinal photographs using deep neural networks. Nature communications12(1), 4828 (2021)

  5. [5]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Deng, H., Li, X.: Anomaly detection via reverse distillation from one-class embed- ding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9737–9746 (2022)

  6. [6]

    In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

    Ding, C., Pang, G., Shen, C.: Catching both gray and black swans: Open-set su- pervised anomaly detection. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 7388–7398 (2022)

  7. [7]

    Computers & Electrical Engineering81, 106532 (2020)

    Gholami, P., Roy, P., Parthasarathy, M.K., Lakshminarayanan, V.: Octid: Optical coherence tomography image database. Computers & Electrical Engineering81, 106532 (2020)

  8. [8]

    In: Proceedings of the IEEE/CVF interna- tional conference on computer vision

    Gong, D., Liu, L., Le, V., Saha, B., Mansour, M.R., Venkatesh, S., Hengel, A.v.d.: Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 1705–1714 (2019)

  9. [9]

    Advances in Neural Information Processing Systems 36 (2024)

    Guo, J., Jia, L., Zhang, W., Li, H., et al.: Recontrast: Domain-specific anomaly de- tection via contrastive reconstruction. Advances in Neural Information Processing Systems 36 (2024)

  10. [10]

    IEEE Transactions on Medical Imaging (2023)

    Guo, J., Lu, S., Jia, L., Zhang, W., Li, H.: Encoder-decoder contrast for unsuper- vised anomaly detection in medical images. IEEE Transactions on Medical Imaging (2023)

  11. [11]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Han, X., Chen, X., Liu, L.P.: Gan ensemble for anomaly detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 4090–4097 (2021) 10 C. Lian et al

  12. [12]

    Journal of medical Internet research 23(7), e27822 (2021)

    Han, Y., Li, W., Liu, M., Wu, Z., Zhang, F., Liu, X., Tao, L., Li, X., Guo, X.: Application of an anomaly detection model to screen for ocular diseases using color retinal fundus images: design and evaluation study. Journal of medical Internet research 23(7), e27822 (2021)

  13. [13]

    Medical image analysis55, 216–227 (2019)

    Hu, J., Chen, Y., Yi, Z.: Automated segmentation of macular edema in oct using deep neural networks. Medical image analysis55, 216–227 (2019)

  14. [14]

    Advances in Neural In- formation Processing Systems35, 15433–15445 (2022)

    Jiang, X., Liu, J., Wang, J., Nie, Q., Wu, K., Liu, Y., Wang, C., Zheng, F.: Soft- patch: Unsupervised anomaly detection with noisy data. Advances in Neural In- formation Processing Systems35, 15433–15445 (2022)

  15. [15]

    Karthik, Maggie, Dane, S.: Aptos 2019 blindness detection.https://kaggle.com/ competitions/aptos2019-blindness-detection (2019), kaggle

  16. [16]

    cell172(5), 1122–1131 (2018)

    Kermany, D.S., Goldbaum, M., Cai, W., Valentim, C.C., Liang, H., Baxter, S.L., McKeown, A., Yang, G., Wu, X., Yan, F., et al.: Identifying medical diagnoses and treatable diseases by image-based deep learning. cell172(5), 1122–1131 (2018)

  17. [17]

    Scientific Data 11(1), 365 (2024)

    Kulyabin, M., Zhdanov, A., Nikiforova, A., Stepichev, A., Kuznetsova, A., Ronkin, M., Borisov, V., Bogachev, A., Korotkich, S., Constable, P.A., et al.: Octdl: Optical coherence tomography dataset for image-based deep learning methods. Scientific Data 11(1), 365 (2024)

  18. [18]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, C.L., Sohn, K., Yoon, J., Pfister, T.: Cutpaste: Self-supervised learning for anomaly detection and localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9664–9674 (2021)

  19. [19]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, L., Xu, M., Wang, X., Jiang, L., Liu, H.: Attention based glaucoma detection: A large-scale database and cnn model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10571–10580 (2019)

  20. [20]

    Current Diabetes Reports21, 1–16 (2021)

    Li,Y.,Mitchell,W.,Elze,T.,Zebardast,N.:Associationbetweendiabetes,diabetic retinopathy, and glaucoma. Current Diabetes Reports21, 1–16 (2021)

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, Z., Zhou, Y., Xu, Y., Wang, Z.: Simplenet: A simple network for image anomaly detection and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20402–20411 (2023)

  22. [22]

    PLOS Digital Health3(7), e0000454 (2024)

    Nakayama, L.F., Restrepo, D., Matos, J., Ribeiro, L.Z., Malerbi, F.K., Celi, L.A., et al.: Brset: A brazilian multilabel ophthalmological dataset of retina fundus photos. PLOS Digital Health3(7), e0000454 (2024). https://doi.org/10.1371/ journal.pdig.0000454, https://doi.org/10.1371/journal.pdig.0000454

  23. [23]

    IEEE transac- tions on biomedical engineering53(6), 1084–1098 (2006)

    Narasimha-Iyer, H., Can, A., Roysam, B., Stewart, V., Tanenbaum, H.L., Ma- jerovics, A., Singh, H.: Robust detection and classification of longitudinal changes in color retinal fundus images for monitoring diabetic retinopathy. IEEE transac- tions on biomedical engineering53(6), 1084–1098 (2006)

  24. [24]

    Pachade, S., Porwal, P., Thulkar, D., Kokare, M., Deshmukh, G., Sahasrabuddhe, V., Giancardo, L., Quellec, G., Mériaudeau, F.: Retinal fundus multi-disease image dataset(rfmid):adatasetformulti-diseasedetectionresearch.Data 6(2), 14(2021)

  25. [25]

    Advances in neural information processing sys- tems 32 (2019)

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing sys- tems 32 (2019)

  26. [26]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Roth, K., Pemula, L., Zepeda, J., Schölkopf, B., Brox, T., Gehler, P.: Towards total recall in industrial anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14318–14328 (2022)

  27. [27]

    Signal Processing: Image Communication 127, 117151 BenchReAD: A systematic benchmark for retinal anomaly detection 11 (2024)

    Xia, X., Li, Y., Xiao, G., Zhan, K., Yan, J., Cai, C., Fang, Y., Huang, G.: Benchmarking deep models on retinal fundus disease diagnosis and a large-scale dataset. Signal Processing: Image Communication 127, 117151 BenchReAD: A systematic benchmark for retinal anomaly detection 11 (2024). https://doi.org/https://doi.org/10.1016/j.image.2024.117151, https:...

  28. [28]

    IEEE Transac- tions on Cybernetics (2024)

    Xie, G., Wang, J., Liu, J., Lyu, J., Liu, Y., Wang, C., Zheng, F., Jin, Y.: Im-iad: Industrial image anomaly detection benchmark in manufacturing. IEEE Transac- tions on Cybernetics (2024)

  29. [29]

    In: 2024 International Joint Conference on Neural Networks (IJCNN)

    Zhai, M., Wu, X., He, Z., Wang, C., Wang, H., Wang, P.: Dual-branch retinal oct anomaly detection based on knowledge distillation and reconstruction. In: 2024 International Joint Conference on Neural Networks (IJCNN). pp. 1–8. IEEE (2024)