BenchReAD: A systematic benchmark for retinal anomaly detection
Pith reviewed 2026-05-21 23:17 UTC · model grok-4.3
The pith
A new retinal anomaly benchmark shows that adding a normal feature memory to disentangled abnormality representations cuts performance drops on unseen cases and sets a new state of the art.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Retinal anomaly detection has been held back by limited anomaly variety, near-saturated test sets, and missing generalization checks in prior benchmarks. A fully supervised method using disentangled representations of abnormalities reaches the highest scores but shows large drops when tested on certain unseen anomalies. Adding a Normal Feature Memory bank to store and compare normal features mitigates those drops, producing NFM-DRA that establishes a new state of the art on the proposed benchmark.
What carries the argument
NFM-DRA, which augments disentangled representations of abnormalities with a Normal Feature Memory bank to retain performance on unseen retinal anomalies.
If this is right
- Future retinal anomaly detectors can be evaluated more fairly on varied anomaly types and explicit generalization tasks.
- Methods that store normal features alongside supervised abnormality modeling become a practical route to better robustness.
- The public benchmark enables direct comparison of one-class, semi-supervised, and fully supervised approaches under the same conditions.
- Improved handling of unseen anomalies supports more reliable screening for ocular and systemic diseases in varied patient populations.
Where Pith is reading between the lines
- Similar memory-augmented disentanglement ideas could transfer to anomaly detection in other medical imaging domains such as chest X-rays or histopathology slides.
- The benchmark design highlights the value of mixing labeled abnormal samples with unlabeled data, which is common in clinics but rarely used in current one-class setups.
- If the Normal Feature Memory proves stable, it may reduce the need for constant model retraining when new rare anomalies appear in practice.
Load-bearing premise
The chosen anomaly categories and train-test splits in BenchReAD are representative of real clinical variability and hard enough to expose genuine generalization gaps rather than being too easy or artificial.
What would settle it
If NFM-DRA shows no meaningful reduction in performance drop compared with plain DRA when evaluated on the benchmark's held-out unseen anomaly subsets, the central improvement claim would be refuted.
Figures
read the original abstract
Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, establishing a new SOTA. The benchmark is publicly available at https://github.com/DopamineLcy/BenchReAD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BenchReAD, a comprehensive public benchmark for retinal anomaly detection that addresses prior limitations including overly simplistic anomaly sets, saturated test splits, and insufficient generalization evaluation. The authors categorize existing methods, report that a fully supervised disentangled representations of abnormalities (DRA) approach yields the strongest results yet exhibits notable performance degradation on certain unseen anomalies, and propose NFM-DRA which augments DRA with a Normal Feature Memory mechanism to recover performance and establish a new state-of-the-art on the benchmark.
Significance. If the benchmark splits prove representative and the reported gains hold under rigorous statistical scrutiny, this work supplies the first systematic public resource for retinal anomaly detection, enabling standardized comparisons across one-class, supervised, and hybrid paradigms. The explicit contrast to saturated prior setups and the public GitHub release are strengths that support reproducibility and incremental progress toward clinically robust detectors.
major comments (2)
- [§4.3 and Table 4] §4.3 and Table 4: the reported performance recovery of NFM-DRA over DRA on unseen-anomaly splits is load-bearing for the central claim of a new SOTA; the manuscript must include per-split standard deviations across at least three random seeds and a statistical significance test (e.g., paired t-test) to confirm the improvement is not attributable to variance.
- [§3.2] §3.2: the construction of the Normal Feature Memory is described at a high level; the precise memory-update rule, capacity, and distance metric used during inference must be specified with equations so that the incremental benefit over plain DRA can be reproduced and isolated.
minor comments (3)
- [Abstract] Abstract: quantitative metrics (e.g., AUC or F1 deltas) for the DRA drop and NFM-DRA recovery are omitted; adding the key headline numbers would make the contribution immediately verifiable.
- [§2 Related Work] §2 Related Work: the discussion of prior retinal benchmarks should explicitly cite the dataset sizes and anomaly counts of the three most-cited works to quantify the claimed improvement in diversity.
- [Figure 3] Figure 3: the legend and axis labels are too small for print; increasing font size and adding error bars would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of BenchReAD and the recommendation for minor revision. The two major comments identify important areas for strengthening statistical rigor and reproducibility; we will address both directly in the revised manuscript.
read point-by-point responses
-
Referee: [§4.3 and Table 4] §4.3 and Table 4: the reported performance recovery of NFM-DRA over DRA on unseen-anomaly splits is load-bearing for the central claim of a new SOTA; the manuscript must include per-split standard deviations across at least three random seeds and a statistical significance test (e.g., paired t-test) to confirm the improvement is not attributable to variance.
Authors: We agree that statistical validation is necessary to support the central claim. In the revised manuscript we will re-run all experiments on the unseen-anomaly splits for both DRA and NFM-DRA using three independent random seeds. Table 4 will be updated to report mean performance together with standard deviations, and we will add paired t-test p-values comparing NFM-DRA against DRA on each split. These additions will be placed in §4.3 and the corresponding table caption. revision: yes
-
Referee: [§3.2] §3.2: the construction of the Normal Feature Memory is described at a high level; the precise memory-update rule, capacity, and distance metric used during inference must be specified with equations so that the incremental benefit over plain DRA can be reproduced and isolated.
Authors: We thank the referee for highlighting the need for precise specification. In the revised §3.2 we will add the exact memory-update equation, state the fixed memory capacity (number of stored normal feature vectors), and define the distance metric (cosine distance) employed at inference time. These details will be presented with numbered equations immediately following the high-level description, enabling direct reproduction and isolation of the NFM contribution. revision: yes
Circularity Check
No significant circularity; empirical benchmark and method proposal
full rationale
The paper introduces a new public benchmark for retinal anomaly detection and reports empirical comparisons of methods, finding that a fully supervised DRA approach performs best yet degrades on unseen anomalies, then proposes NFM-DRA by integrating a Normal Feature Memory inspired by prior one-class memory-bank work. No mathematical derivations, first-principles predictions, or equations appear in the abstract or description. Claims rest on experimental results on the released benchmark rather than any reduction of outputs to fitted inputs or self-citations by construction. The benchmark release and explicit contrast to prior saturated setups provide independent verifiability, making the work self-contained with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Labeled abnormal data and unlabeled data are commonly available in clinical retinal imaging practice
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Bao,J.,Sun,H.,Deng,H.,He,Y.,Zhang,Z.,Li,X.:Bmad:Benchmarksformedical anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4042–4053 (2024)
work page 2024
-
[2]
Medical Image Analysis 86, 102794 (2023)
Cai, Y., Chen, H., Yang, X., Zhou, Y., Cheng, K.T.: Dual-distribution discrepancy with self-supervised refinement for anomaly detection in medical images. Medical Image Analysis 86, 102794 (2023)
work page 2023
-
[3]
arXiv preprint arXiv:2404.04518 (2024)
Cai, Y., Zhang, W., Chen, H., Cheng, K.T.: Medianomaly: A comparative study of anomaly detection in medical images. arXiv preprint arXiv:2404.04518 (2024)
-
[4]
Nature communications12(1), 4828 (2021)
Cen, L.P., Ji, J., Lin, J.W., Ju, S.T., Lin, H.J., Li, T.P., Wang, Y., Yang, J.F., Liu, Y.F., Tan, S., et al.: Automatic detection of 39 fundus diseases and conditions in retinal photographs using deep neural networks. Nature communications12(1), 4828 (2021)
work page 2021
-
[5]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Deng, H., Li, X.: Anomaly detection via reverse distillation from one-class embed- ding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9737–9746 (2022)
work page 2022
-
[6]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Ding, C., Pang, G., Shen, C.: Catching both gray and black swans: Open-set su- pervised anomaly detection. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 7388–7398 (2022)
work page 2022
-
[7]
Computers & Electrical Engineering81, 106532 (2020)
Gholami, P., Roy, P., Parthasarathy, M.K., Lakshminarayanan, V.: Octid: Optical coherence tomography image database. Computers & Electrical Engineering81, 106532 (2020)
work page 2020
-
[8]
In: Proceedings of the IEEE/CVF interna- tional conference on computer vision
Gong, D., Liu, L., Le, V., Saha, B., Mansour, M.R., Venkatesh, S., Hengel, A.v.d.: Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 1705–1714 (2019)
work page 2019
-
[9]
Advances in Neural Information Processing Systems 36 (2024)
Guo, J., Jia, L., Zhang, W., Li, H., et al.: Recontrast: Domain-specific anomaly de- tection via contrastive reconstruction. Advances in Neural Information Processing Systems 36 (2024)
work page 2024
-
[10]
IEEE Transactions on Medical Imaging (2023)
Guo, J., Lu, S., Jia, L., Zhang, W., Li, H.: Encoder-decoder contrast for unsuper- vised anomaly detection in medical images. IEEE Transactions on Medical Imaging (2023)
work page 2023
-
[11]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Han, X., Chen, X., Liu, L.P.: Gan ensemble for anomaly detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 4090–4097 (2021) 10 C. Lian et al
work page 2021
-
[12]
Journal of medical Internet research 23(7), e27822 (2021)
Han, Y., Li, W., Liu, M., Wu, Z., Zhang, F., Liu, X., Tao, L., Li, X., Guo, X.: Application of an anomaly detection model to screen for ocular diseases using color retinal fundus images: design and evaluation study. Journal of medical Internet research 23(7), e27822 (2021)
work page 2021
-
[13]
Medical image analysis55, 216–227 (2019)
Hu, J., Chen, Y., Yi, Z.: Automated segmentation of macular edema in oct using deep neural networks. Medical image analysis55, 216–227 (2019)
work page 2019
-
[14]
Advances in Neural In- formation Processing Systems35, 15433–15445 (2022)
Jiang, X., Liu, J., Wang, J., Nie, Q., Wu, K., Liu, Y., Wang, C., Zheng, F.: Soft- patch: Unsupervised anomaly detection with noisy data. Advances in Neural In- formation Processing Systems35, 15433–15445 (2022)
work page 2022
-
[15]
Karthik, Maggie, Dane, S.: Aptos 2019 blindness detection.https://kaggle.com/ competitions/aptos2019-blindness-detection (2019), kaggle
work page 2019
-
[16]
Kermany, D.S., Goldbaum, M., Cai, W., Valentim, C.C., Liang, H., Baxter, S.L., McKeown, A., Yang, G., Wu, X., Yan, F., et al.: Identifying medical diagnoses and treatable diseases by image-based deep learning. cell172(5), 1122–1131 (2018)
work page 2018
-
[17]
Scientific Data 11(1), 365 (2024)
Kulyabin, M., Zhdanov, A., Nikiforova, A., Stepichev, A., Kuznetsova, A., Ronkin, M., Borisov, V., Bogachev, A., Korotkich, S., Constable, P.A., et al.: Octdl: Optical coherence tomography dataset for image-based deep learning methods. Scientific Data 11(1), 365 (2024)
work page 2024
-
[18]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Li, C.L., Sohn, K., Yoon, J., Pfister, T.: Cutpaste: Self-supervised learning for anomaly detection and localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9664–9674 (2021)
work page 2021
-
[19]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Li, L., Xu, M., Wang, X., Jiang, L., Liu, H.: Attention based glaucoma detection: A large-scale database and cnn model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10571–10580 (2019)
work page 2019
-
[20]
Current Diabetes Reports21, 1–16 (2021)
Li,Y.,Mitchell,W.,Elze,T.,Zebardast,N.:Associationbetweendiabetes,diabetic retinopathy, and glaucoma. Current Diabetes Reports21, 1–16 (2021)
work page 2021
-
[21]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Liu, Z., Zhou, Y., Xu, Y., Wang, Z.: Simplenet: A simple network for image anomaly detection and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20402–20411 (2023)
work page 2023
-
[22]
PLOS Digital Health3(7), e0000454 (2024)
Nakayama, L.F., Restrepo, D., Matos, J., Ribeiro, L.Z., Malerbi, F.K., Celi, L.A., et al.: Brset: A brazilian multilabel ophthalmological dataset of retina fundus photos. PLOS Digital Health3(7), e0000454 (2024). https://doi.org/10.1371/ journal.pdig.0000454, https://doi.org/10.1371/journal.pdig.0000454
-
[23]
IEEE transac- tions on biomedical engineering53(6), 1084–1098 (2006)
Narasimha-Iyer, H., Can, A., Roysam, B., Stewart, V., Tanenbaum, H.L., Ma- jerovics, A., Singh, H.: Robust detection and classification of longitudinal changes in color retinal fundus images for monitoring diabetic retinopathy. IEEE transac- tions on biomedical engineering53(6), 1084–1098 (2006)
work page 2006
-
[24]
Pachade, S., Porwal, P., Thulkar, D., Kokare, M., Deshmukh, G., Sahasrabuddhe, V., Giancardo, L., Quellec, G., Mériaudeau, F.: Retinal fundus multi-disease image dataset(rfmid):adatasetformulti-diseasedetectionresearch.Data 6(2), 14(2021)
work page 2021
-
[25]
Advances in neural information processing sys- tems 32 (2019)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing sys- tems 32 (2019)
work page 2019
-
[26]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Roth, K., Pemula, L., Zepeda, J., Schölkopf, B., Brox, T., Gehler, P.: Towards total recall in industrial anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14318–14328 (2022)
work page 2022
-
[27]
Xia, X., Li, Y., Xiao, G., Zhan, K., Yan, J., Cai, C., Fang, Y., Huang, G.: Benchmarking deep models on retinal fundus disease diagnosis and a large-scale dataset. Signal Processing: Image Communication 127, 117151 BenchReAD: A systematic benchmark for retinal anomaly detection 11 (2024). https://doi.org/https://doi.org/10.1016/j.image.2024.117151, https:...
-
[28]
IEEE Transac- tions on Cybernetics (2024)
Xie, G., Wang, J., Liu, J., Lyu, J., Liu, Y., Wang, C., Zheng, F., Jin, Y.: Im-iad: Industrial image anomaly detection benchmark in manufacturing. IEEE Transac- tions on Cybernetics (2024)
work page 2024
-
[29]
In: 2024 International Joint Conference on Neural Networks (IJCNN)
Zhai, M., Wu, X., He, Z., Wang, C., Wang, H., Wang, P.: Dual-branch retinal oct anomaly detection based on knowledge distillation and reconstruction. In: 2024 International Joint Conference on Neural Networks (IJCNN). pp. 1–8. IEEE (2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.