Recognition: 2 theorem links
· Lean TheoremOphMAE: Bridging Volumetric and Planar Imaging with a Foundation Model for Adaptive Ophthalmological Diagnosis
Pith reviewed 2026-05-08 18:48 UTC · model grok-4.3
The pith
A foundation model fuses 3D OCT volumes with 2D en face images to reach state-of-the-art accuracy on 17 ophthalmic tasks while adapting to single-modality or low-data inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By pre-training on 183,875 paired OCT images from 32,765 patients, OphMAE uses its cross-modal fusion architecture and adaptive inference mechanism to integrate volumetric 3D OCT with planar 2D en face OCT, resulting in state-of-the-art performance across 17 diagnostic tasks on 48,340 images from 8,191 patients, including AUCs of 96.9% for AMD and 97.2% for DME, while also performing well under single-modality or low-data constraints.
What carries the argument
The novel cross-modal fusion architecture that merges features from 3D volumetric OCT and 2D planar en face OCT, combined with the adaptive inference mechanism that enables flexible use of available modalities.
If this is right
- The model surpasses prior single-modal and multimodal foundation models on the benchmark tasks.
- It sustains 93.7% AUC for AMD diagnosis using only 2D inputs.
- Performance holds at 95.7% AUC even with just 500 labeled samples for training.
- Offers a framework that works across different resource levels for ophthalmic diagnosis.
- The approach scales to handle diverse diagnostic tasks without requiring full 3D hardware.
Where Pith is reading between the lines
- Clinics lacking 3D OCT scanners could still benefit from the model's 2D fallback mode for reliable screening.
- Similar cross-modal pretraining strategies might apply to other medical fields combining 3D and 2D scans, like radiology.
- Further scaling the pretraining dataset could improve data efficiency even more for rare eye conditions.
- The adaptive mechanism suggests potential for models that dynamically choose the most informative modality per patient case.
Load-bearing premise
The cross-modal fusion and adaptive inference extract truly complementary information from the paired 3D and 2D images that generalizes to new data without relying on dataset-specific patterns or leakage.
What would settle it
Demonstrating that the model's performance on an external validation set from unseen patient populations falls to levels comparable to or below single-modality models would indicate that the claimed complementarity does not hold in general.
Figures
read the original abstract
The advent of foundation models has heralded a new era in medical artificial intelligence (AI), enabling the extraction of generalizable representations from large-scale unlabeled datasets. However, current ophthalmic AI paradigms are predominantly constrained to single-modality inference, thereby creating a dissonance with clinical practice where diagnosis relies on the synthesis of complementary imaging modalities. Furthermore, the deployment of high-performance AI in resource-limited settings is frequently impeded by the unavailability of advanced three-dimensional imaging hardware. Here, we present the Ophthalmic multimodal Masked Autoencoder (OphMAE), a multi-imaging foundation model engineered to synergize the volumetric depth of 3D Optical Coherence Tomography (OCT) with the planar context of 2D en face OCT. By implementing a novel cross-modal fusion architecture and a unique adaptive inference mechanism, OphMAE was pre-trained on a massive dataset with of 183,875 paired OCT images derived from 32,765 patients. In a rigorous benchmark encompassing 17 diverse diagnostic tasks with 48,340 paired OCT images from 8,191 patients, the model demonstrated state-of-the-art performance, achieving an Area Under the Curve (AUC) of 96.9% for Age-related Macular Degeneration (AMD) and 97.2% for Diabetic Macular Edema (DME), consistently surpassing existing single-modal and multimodal foundation models. Crucially, OphMAE exhibits robust engineering adaptability: it maintains high diagnostic accuracy, such as 93.7\% AUC for AMD, even when restricted to single-modality 2D inputs, and demonstrates exceptional data efficiency by retaining 95.7% AUC with as few as 500 labeled samples. This work establishes a scalable and adaptable framework for ophthalmic AI, ensuring robust performance across different tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OphMAE, a multimodal masked autoencoder foundation model designed to fuse volumetric 3D OCT with planar 2D en face OCT for ophthalmic diagnosis. Pre-trained on 183,875 paired images from 32,765 patients, it reports SOTA performance across 17 diagnostic tasks on 48,340 paired images from 8,191 patients, with AUCs of 96.9% for AMD and 97.2% for DME, while claiming robustness to single-modality 2D inputs (93.7% AMD AUC) and data efficiency (95.7% AUC with 500 labels).
Significance. If the cross-modal fusion and adaptive inference genuinely extract complementary information that generalizes, the work could meaningfully advance adaptable ophthalmic AI, particularly for resource-limited settings lacking 3D hardware. The data-efficiency results and single-modality fallback are potentially valuable strengths if properly validated against leakage risks.
major comments (2)
- [Abstract and §4 (Benchmark)] Abstract and §4 (Benchmark): The central SOTA claims (96.9% AMD AUC, 97.2% DME AUC) rest on the assumption that the pretraining cohort (32,765 patients) and benchmark cohort (8,191 patients) are strictly patient-disjoint. No details are provided on patient-ID handling or splitting procedure for the paired images; in ophthalmic data, patient-specific anatomy or artifacts can leak across sets and inflate performance without the architecture contributing beyond single-modality baselines.
- [§4 (Results)] §4 (Results): The reported AUC values and SOTA comparisons lack error bars, confidence intervals, number of runs, or statistical significance tests against the cited single-modal and multimodal baselines. This makes it impossible to determine whether the gains from cross-modal fusion are reliable or load-bearing for the generalization claims.
minor comments (3)
- [Abstract] Abstract: Typo in 'pre-trained on a massive dataset with of 183,875 paired OCT images' (extra 'of').
- [Abstract] Abstract: The 17 diagnostic tasks are not listed or characterized, preventing assessment of task diversity and clinical relevance.
- [Abstract] Abstract: No reference to specific prior multimodal MAE works or ophthalmic foundation models used as baselines.
Simulated Author's Rebuttal
We sincerely thank the referee for their thorough and constructive review of our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation of our methods and results.
read point-by-point responses
-
Referee: Abstract and §4 (Benchmark): The central SOTA claims (96.9% AMD AUC, 97.2% DME AUC) rest on the assumption that the pretraining cohort (32,765 patients) and benchmark cohort (8,191 patients) are strictly patient-disjoint. No details are provided on patient-ID handling or splitting procedure for the paired images; in ophthalmic data, patient-specific anatomy or artifacts can leak across sets and inflate performance without the architecture contributing beyond single-modality baselines.
Authors: We appreciate the referee's emphasis on preventing data leakage in ophthalmic imaging studies. The pre-training and evaluation cohorts were constructed to be strictly patient-disjoint by performing all splits at the patient level using unique patient identifiers; no images from the same patient were allowed to appear in both sets. We will add a detailed subsection on patient-ID handling and the splitting protocol (including how paired 3D/2D images were managed) to the Data and Methods sections of the revised manuscript. revision: yes
-
Referee: §4 (Results): The reported AUC values and SOTA comparisons lack error bars, confidence intervals, number of runs, or statistical significance tests against the cited single-modal and multimodal baselines. This makes it impossible to determine whether the gains from cross-modal fusion are reliable or load-bearing for the generalization claims.
Authors: We agree that reporting variability and statistical comparisons is essential for substantiating the performance gains. We have performed additional runs with five different random seeds and will update §4 to include error bars (standard deviation), 95% confidence intervals, and results of appropriate statistical tests (e.g., paired t-tests) comparing OphMAE against the single-modal and multimodal baselines. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical multimodal masked autoencoder pretrained on paired OCT data and evaluated via AUC on 17 diagnostic tasks. No derivation chain, equations, or predictions are presented that reduce by construction to fitted inputs, self-citations, or renamed ansatzes. Performance metrics are reported from benchmark splits rather than being algebraically forced by the model's own definitions. The central claims rest on external validation rather than internal self-definition.
Axiom & Free-Parameter Ledger
free parameters (2)
- Cross-modal fusion weights
- Adaptive inference thresholds
axioms (1)
- domain assumption Masked autoencoder pretraining on large paired multimodal medical images produces representations that transfer to multiple downstream diagnostic tasks.
Lean theorems connected to this paper
-
Cost.FunctionalEquation (J(x)=½(x+x⁻¹)−1)washburn_uniqueness_aczel — paper's losses are MSE-based ML objectives with tunable λ's, not ratio-symmetric J-costs; no structural contact. unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The overall optimization objective is defined as L_total = λ1 L_recon + λ2 L_cross_relation + λ3 L_consistency
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
J.et al.The lancet global health commission on global eye health: vision beyond 2020.The Lancet Global Health9, e489–e551 (2021)
Burton, M. J.et al.The lancet global health commission on global eye health: vision beyond 2020.The Lancet Global Health9, e489–e551 (2021)
2020
-
[2]
Bourne, R.et al.Trends in prevalence of blindness and distance and near vision impairment over 30 years: an analysis for the global burden of disease study.The Lancet global health9, e130–e143 (2021)
2021
-
[3]
R.et al.Global causes of blindness and distance vision impairment 1990–2020: a systematic review and meta-analysis.The Lancet Global Health5, e1221–e1234 (2017)
Flaxman, S. R.et al.Global causes of blindness and distance vision impairment 1990–2020: a systematic review and meta-analysis.The Lancet Global Health5, e1221–e1234 (2017)
1990
-
[4]
A.et al.Age-related macular degeneration preferred practice pattern®.Ophthalmology 132, P1–P74 (2025)
Vemulakonda, G. A.et al.Age-related macular degeneration preferred practice pattern®.Ophthalmology 132, P1–P74 (2025)
2025
-
[5]
I.et al.Diabetic retinopathy preferred practice pattern®.Ophthalmology132, P75–P162 (2025)
Lim, J. I.et al.Diabetic retinopathy preferred practice pattern®.Ophthalmology132, P75–P162 (2025)
2025
-
[6]
J.et al.Primary open-angle glaucoma preferred practice pattern®.Ophthalmology128, P71– P150 (2021)
Gedde, S. J.et al.Primary open-angle glaucoma preferred practice pattern®.Ophthalmology128, P71– P150 (2021)
2021
-
[7]
R.et al.Consensus definition for atrophy associated with age-related macular degeneration on oct: classification of atrophy report 3.Ophthalmology125, 537–548 (2018)
Sadda, S. R.et al.Consensus definition for atrophy associated with age-related macular degeneration on oct: classification of atrophy report 3.Ophthalmology125, 537–548 (2018)
2018
-
[8]
U., Grinton, M., Mandelcorn, E
Pandya, B. U., Grinton, M., Mandelcorn, E. D. & Felfeli, T. Retinal optical coherence tomography imaging biomarkers: a review of the literature.Retina44, 369–380 (2024)
2024
-
[9]
Ahn, S. J. Retinal thickness analysis using optical coherence tomography: diagnostic and monitoring applications in retinal diseases.Diagnostics15, 833 (2025)
2025
-
[10]
K.et al.Automated 3-d intraretinal layer segmentation of macular spectral-domain optical coherence tomography images.IEEE transactions on medical imaging28, 1436–1447 (2009)
Garvin, M. K.et al.Automated 3-d intraretinal layer segmentation of macular spectral-domain optical coherence tomography images.IEEE transactions on medical imaging28, 1436–1447 (2009)
2009
-
[12]
Liu, X.et al.A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis.The lancet digital health1, e271– e297 (2019)
2019
-
[13]
M.et al.International evaluation of an ai system for breast cancer screening.Nature577, 89–94 (2020)
McKinney, S. M.et al.International evaluation of an ai system for breast cancer screening.Nature577, 89–94 (2020)
2020
-
[14]
Ting, D. S. W.et al.Artificial intelligence and deep learning in ophthalmology.British Journal of Oph- thalmology103, 167–175 (2019)
2019
-
[15]
D., Lavin, P
Abr `amoff, M. D., Lavin, P. T., Birch, M., Shah, N. & Folk, J. C. Pivotal trial of an autonomous ai-based diagnostic system for detection of diabetic retinopathy in primary care offices.NPJ digital medicine1, 39 (2018)
2018
-
[16]
Gulshan, V .et al.Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.jama316, 2402–2410 (2016)
2016
-
[17]
De Fauw, J.et al.Clinically applicable deep learning for diagnosis and referral in retinal disease.Nature medicine24, 1342–1350 (2018)
2018
-
[18]
InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 16000–16009 (2022)
He, K.et al.Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 16000–16009 (2022)
2022
-
[19]
Zhou, Y .et al.A foundation model for generalizable disease detection from retinal images.Nature622, 156–163 (2023). 22
2023
-
[20]
Qiu, J.et al.Development and validation of a multimodal multitask vision foundation model for generalist ophthalmic artificial intelligence.NEJM AI1, AIoa2300221 (2024)
2024
- [21]
-
[22]
Schmidt-Erfurth, U.et al.Guidelines for the management of neovascular age-related macular degeneration by the european society of retina specialists (euretina).British Journal of Ophthalmology98, 1144–1167 (2014)
2014
-
[23]
F., Klancnik, J
Spaide, R. F., Klancnik, J. M. & Cooney, M. J. Retinal vascular layers imaged by fluorescein angiography and optical coherence tomography angiography.JAMA ophthalmology133, 45–50 (2015)
2015
-
[24]
B., Sarraf, D., Mieler, W
Freund, K. B., Sarraf, D., Mieler, W. F. & Yannuzzi, L. A. The retinal atlas (2nd ed.) (2016)
2016
-
[25]
Feo, A.et al.En face oct: breakthroughs in understanding the pathoanatomy of retinal disease and clinical applications.Progress in retinal and eye research106, 101351 (2025)
2025
-
[26]
URLhttps://www.sciencedirect
Liu, J.et al.Multimodal imaging and en face oct detection of calcified drusen in eyes with age-related mac- ular degeneration.Ophthalmology Science2, 100162 (2022). URLhttps://www.sciencedirect. com/science/article/pii/S2666914522000513
2022
-
[27]
& Sarraf, D
Feo, A. & Sarraf, D. En face optical coherence tomography and oct angiography in the pathoanatomy of inflammatory macular disease.American Journal of Ophthalmology284, 110–122 (2026). URLhttps: //www.sciencedirect.com/science/article/pii/S0002939425006890
2026
-
[28]
URLhttps://www.sciencedirect.com/science/article/pii/ S2666914522000057
Laiginhas, R.et al.Multimodal imaging, oct b-scan localization, and en face oct detection of macu- lar hyperpigmentation in eyes with intermediate age-related macular degeneration.Ophthalmology Sci- ence2, 100116 (2022). URLhttps://www.sciencedirect.com/science/article/pii/ S2666914522000057
2022
-
[29]
& Cohen, F
Chhablani, J. & Cohen, F. Central serous chorioretinopathy international group.Multimodal imaging- based central serous chorioretinopathy classification. Ophthalmol Retina4, 1043–6 (2020)
2020
-
[30]
& Akai, R
Tsuboi, K., Fukushima, M. & Akai, R. How optical coherence tomography has changed the management of macular holes: A narrative review.Taiwan Journal of Ophthalmology15, 344–353 (2025)
2025
-
[31]
T.et al.Idiopathic epiretinal membrane and vitreomacular traction preferred practice pattern®
Bailey, S. T.et al.Idiopathic epiretinal membrane and vitreomacular traction preferred practice pattern®. Ophthalmology132, P197–P233 (2025)
2025
-
[32]
Chen, J. C. & Lee, L. R. Clinical spectrum of lamellar macular defects including pseudoholes and pseudo- cysts defined by optical coherence tomography.British Journal of Ophthalmology92, 1342–1346 (2008)
2008
-
[33]
You, L.et al.The impact of aging on ocular diseases: unveiling complex interactions.Aging and disease 16, 2803 (2024)
2024
-
[34]
& Zeppieri, M
Ramovecchi, P., Salati, C. & Zeppieri, M. Spontaneous posterior vitreous detachment: A glance at the current literature.World Journal of Experimental Medicine11, 30 (2021)
2021
-
[35]
& Munk, M
Tillmann, A., Ceklic, L., Dysli, C. & Munk, M. R. Gender differences in retinal diseases: A review. Clinical & experimental ophthalmology52, 317–333 (2024). 23 A UF Data Processing and Description Figure 8:Data curation and cohort construction pipeline for retinal OCT.The pipeline summarizes the integration of raw retinal imaging archives and image metada...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.