arxiv: 2605.02714 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

OphMAE: Bridging Volumetric and Planar Imaging with a Foundation Model for Adaptive Ophthalmological Diagnosis

Tienyu Chang , Zhen Chen , Renjie Liang , Jinyu Ding , Jie Xu , Sunu Mathew , Amir Reza Hajrasouliha , Andrew J. Saykin

show 4 more authors

Ruogu Fang Yu Huang Jiang Bian Qingyu Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords ophthalmic foundation modelcross-modal fusionOCT imagingmasked autoencodermultimodal diagnosisAMDDMEadaptive inference

0 comments

The pith

A foundation model fuses 3D OCT volumes with 2D en face images to reach state-of-the-art accuracy on 17 ophthalmic tasks while adapting to single-modality or low-data inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OphMAE to close the gap between single-modality AI systems and real clinical practice that combines multiple eye imaging types for diagnosis. The model is pretrained on a large collection of paired 3D and 2D OCT scans using cross-modal fusion and an adaptive inference method that lets it draw on volumetric depth plus planar context. On a benchmark of 17 diagnostic tasks involving tens of thousands of images from thousands of patients, it outperforms earlier single-modal and multimodal models, with AUCs of 96.9 percent for AMD and 97.2 percent for DME. It also keeps high accuracy when limited to 2D inputs alone or when fine-tuned on only a few hundred labeled examples. The result is a framework meant to support reliable ophthalmic AI even in settings that lack advanced 3D hardware.

Core claim

By pre-training on 183,875 paired OCT images from 32,765 patients, OphMAE uses its cross-modal fusion architecture and adaptive inference mechanism to integrate volumetric 3D OCT with planar 2D en face OCT, resulting in state-of-the-art performance across 17 diagnostic tasks on 48,340 images from 8,191 patients, including AUCs of 96.9% for AMD and 97.2% for DME, while also performing well under single-modality or low-data constraints.

What carries the argument

The novel cross-modal fusion architecture that merges features from 3D volumetric OCT and 2D planar en face OCT, combined with the adaptive inference mechanism that enables flexible use of available modalities.

If this is right

The model surpasses prior single-modal and multimodal foundation models on the benchmark tasks.
It sustains 93.7% AUC for AMD diagnosis using only 2D inputs.
Performance holds at 95.7% AUC even with just 500 labeled samples for training.
Offers a framework that works across different resource levels for ophthalmic diagnosis.
The approach scales to handle diverse diagnostic tasks without requiring full 3D hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Clinics lacking 3D OCT scanners could still benefit from the model's 2D fallback mode for reliable screening.
Similar cross-modal pretraining strategies might apply to other medical fields combining 3D and 2D scans, like radiology.
Further scaling the pretraining dataset could improve data efficiency even more for rare eye conditions.
The adaptive mechanism suggests potential for models that dynamically choose the most informative modality per patient case.

Load-bearing premise

The cross-modal fusion and adaptive inference extract truly complementary information from the paired 3D and 2D images that generalizes to new data without relying on dataset-specific patterns or leakage.

What would settle it

Demonstrating that the model's performance on an external validation set from unseen patient populations falls to levels comparable to or below single-modality models would indicate that the claimed complementarity does not hold in general.

Figures

Figures reproduced from arXiv: 2605.02714 by Amir Reza Hajrasouliha, Andrew J. Saykin, Jiang Bian, Jie Xu, Jinyu Ding, Qingyu Chen, Renjie Liang, Ruogu Fang, Sunu Mathew, Tienyu Chang, Yu Huang, Zhen Chen.

**Figure 1.** Figure 1: Overview of the proposed OphMAE framework. 1) Large-scale paired ophthalmic imaging data were curated from the University of Florida (UF) Health clinical data repository, comprising paired 3D OCT volumes and 2D en face OCT images. The dataset was divided into 183,875 paired images from 32,765 patients for self-supervised pretraining and 48,340 paired images from 8,191 patients for downstream benchmarking. … view at source ↗

**Figure 2.** Figure 2: Performance comparison of OphMAE and baseline models across evaluated tasks. The figure summarizes the AUROC (a) and F1 scores (b) of OphMAE and the baseline models on the downstream classification tasks. AUROC was used to assess overall discriminative performance, whereas F1 score was used to evaluate the balance between precision and recall. The results show the relative predictive performance of the pr… view at source ↗

**Figure 3.** Figure 3: AUROC comparison of proposed state-of-the-art and OphMAE models under different input modality settings. a) The figure demonstrates the AUROC improvement of multi-modality in state-of-the-art foundation models across the downstream tasks. b) The figure summarizes the AUROC achieved by proposed OphMAE model across the downstream tasks when using different imaging modalities as input. We calculate the perfor… view at source ↗

**Figure 4.** Figure 4: AUROC of OphMAE and baseline models across different training set sizes. The figure summarizes the downstream classification performance of OphMAE and the baseline models under subset fine-tuning settings. Test AUROC is shown for each model when the number of training samples per class was progressively reduced, enabling comparison of performance robustness under limited-data conditions. Error bars indic… view at source ↗

**Figure 5.** Figure 5: FPR-based subgroup fairness comparison between OphMAE and RETFound. The figure summarizes the false-positive rate (FPR) ratios between protected and privileged groups across age, sex, and race/ethnicity subgroups for OphMAE and RETFound. A ratio closer to 1.0 indicates better subgroup parity, whereas larger deviations from 1.0 indicate greater disparity in false-positive performance. 12 view at source ↗

**Figure 6.** Figure 6: AUROC performance of OphMAE and comparator foundation models on public singlemodality benchmarks. a, Public fundus datasets. b, Public 3D OCT datasets. For each benchmark, the corresponding modality-specific model was fine-tuned and evaluated. AUROC is reported to assess discriminative performance across datasets. To further investigate the effectiveness of the proposed OphMAE modality-specific pretraine… view at source ↗

**Figure 7.** Figure 7: Architecture and self-supervised pre-training strategy of OphMAE. The proposed OphMAE framework consists of modality-specific encoders for 3D volumetric OCT and 2D en face OCT, a bidirectional cross-modal fusion module, and lightweight modality-specific decoders used only during pre-training. The model is optimized with three objectives: modality-specific reconstruction loss, multi-space semantic similarit… view at source ↗

**Figure 8.** Figure 8: Data curation and cohort construction pipeline for retinal OCT. The pipeline summarizes the integration of raw retinal imaging archives and image metadata, followed by study-level matching, quality filtering, modality selection, patient-level dataset splitting, and downstream label assignment. As shown in view at source ↗

**Figure 9.** Figure 9: Qualitative attribution maps of OphMAE for representative retinal diseases (set 1). Representative heatmaps generated from the fine-tuned OphMAE model for AMD, CRVO, CSR, DME, DR, and drusen. Attribution maps are shown for both the 3D OCT volume and the paired 2D en face OCT image, with warmer colors indicating regions that contributed more strongly to the model prediction. For the volumetric branch, heat… view at source ↗

**Figure 10.** Figure 10: Qualitative attribution maps of OphMAE for representative retinal diseases (set 2). Representative heatmaps generated from the fine-tuned OphMAE model for AMD, CRVO, CSR, DME, DR, and drusen. Attribution maps are shown for both the 3D OCT volume and the paired 2D en face OCT image, with warmer colors indicating regions that contributed more strongly to the model prediction. For the volumetric branch, hea… view at source ↗

read the original abstract

The advent of foundation models has heralded a new era in medical artificial intelligence (AI), enabling the extraction of generalizable representations from large-scale unlabeled datasets. However, current ophthalmic AI paradigms are predominantly constrained to single-modality inference, thereby creating a dissonance with clinical practice where diagnosis relies on the synthesis of complementary imaging modalities. Furthermore, the deployment of high-performance AI in resource-limited settings is frequently impeded by the unavailability of advanced three-dimensional imaging hardware. Here, we present the Ophthalmic multimodal Masked Autoencoder (OphMAE), a multi-imaging foundation model engineered to synergize the volumetric depth of 3D Optical Coherence Tomography (OCT) with the planar context of 2D en face OCT. By implementing a novel cross-modal fusion architecture and a unique adaptive inference mechanism, OphMAE was pre-trained on a massive dataset with of 183,875 paired OCT images derived from 32,765 patients. In a rigorous benchmark encompassing 17 diverse diagnostic tasks with 48,340 paired OCT images from 8,191 patients, the model demonstrated state-of-the-art performance, achieving an Area Under the Curve (AUC) of 96.9% for Age-related Macular Degeneration (AMD) and 97.2% for Diabetic Macular Edema (DME), consistently surpassing existing single-modal and multimodal foundation models. Crucially, OphMAE exhibits robust engineering adaptability: it maintains high diagnostic accuracy, such as 93.7\% AUC for AMD, even when restricted to single-modality 2D inputs, and demonstrates exceptional data efficiency by retaining 95.7% AUC with as few as 500 labeled samples. This work establishes a scalable and adaptable framework for ophthalmic AI, ensuring robust performance across different tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces OphMAE, a multimodal masked autoencoder foundation model designed to fuse volumetric 3D OCT with planar 2D en face OCT for ophthalmic diagnosis. Pre-trained on 183,875 paired images from 32,765 patients, it reports SOTA performance across 17 diagnostic tasks on 48,340 paired images from 8,191 patients, with AUCs of 96.9% for AMD and 97.2% for DME, while claiming robustness to single-modality 2D inputs (93.7% AMD AUC) and data efficiency (95.7% AUC with 500 labels).

Significance. If the cross-modal fusion and adaptive inference genuinely extract complementary information that generalizes, the work could meaningfully advance adaptable ophthalmic AI, particularly for resource-limited settings lacking 3D hardware. The data-efficiency results and single-modality fallback are potentially valuable strengths if properly validated against leakage risks.

major comments (2)

[Abstract and §4 (Benchmark)] Abstract and §4 (Benchmark): The central SOTA claims (96.9% AMD AUC, 97.2% DME AUC) rest on the assumption that the pretraining cohort (32,765 patients) and benchmark cohort (8,191 patients) are strictly patient-disjoint. No details are provided on patient-ID handling or splitting procedure for the paired images; in ophthalmic data, patient-specific anatomy or artifacts can leak across sets and inflate performance without the architecture contributing beyond single-modality baselines.
[§4 (Results)] §4 (Results): The reported AUC values and SOTA comparisons lack error bars, confidence intervals, number of runs, or statistical significance tests against the cited single-modal and multimodal baselines. This makes it impossible to determine whether the gains from cross-modal fusion are reliable or load-bearing for the generalization claims.

minor comments (3)

[Abstract] Abstract: Typo in 'pre-trained on a massive dataset with of 183,875 paired OCT images' (extra 'of').
[Abstract] Abstract: The 17 diagnostic tasks are not listed or characterized, preventing assessment of task diversity and clinical relevance.
[Abstract] Abstract: No reference to specific prior multimodal MAE works or ophthalmic foundation models used as baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for their thorough and constructive review of our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation of our methods and results.

read point-by-point responses

Referee: Abstract and §4 (Benchmark): The central SOTA claims (96.9% AMD AUC, 97.2% DME AUC) rest on the assumption that the pretraining cohort (32,765 patients) and benchmark cohort (8,191 patients) are strictly patient-disjoint. No details are provided on patient-ID handling or splitting procedure for the paired images; in ophthalmic data, patient-specific anatomy or artifacts can leak across sets and inflate performance without the architecture contributing beyond single-modality baselines.

Authors: We appreciate the referee's emphasis on preventing data leakage in ophthalmic imaging studies. The pre-training and evaluation cohorts were constructed to be strictly patient-disjoint by performing all splits at the patient level using unique patient identifiers; no images from the same patient were allowed to appear in both sets. We will add a detailed subsection on patient-ID handling and the splitting protocol (including how paired 3D/2D images were managed) to the Data and Methods sections of the revised manuscript. revision: yes
Referee: §4 (Results): The reported AUC values and SOTA comparisons lack error bars, confidence intervals, number of runs, or statistical significance tests against the cited single-modal and multimodal baselines. This makes it impossible to determine whether the gains from cross-modal fusion are reliable or load-bearing for the generalization claims.

Authors: We agree that reporting variability and statistical comparisons is essential for substantiating the performance gains. We have performed additional runs with five different random seeds and will update §4 to include error bars (standard deviation), 95% confidence intervals, and results of appropriate statistical tests (e.g., paired t-tests) comparing OphMAE against the single-modal and multimodal baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical multimodal masked autoencoder pretrained on paired OCT data and evaluated via AUC on 17 diagnostic tasks. No derivation chain, equations, or predictions are presented that reduce by construction to fitted inputs, self-citations, or renamed ansatzes. Performance metrics are reported from benchmark splits rather than being algebraically forced by the model's own definitions. The central claims rest on external validation rather than internal self-definition.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on empirical success of standard masked autoencoder pretraining plus two new architectural components whose effectiveness is demonstrated only through training on the described dataset; no new physical or mathematical entities are introduced.

free parameters (2)

Cross-modal fusion weights
Learned parameters that integrate 3D volumetric and 2D planar features during pretraining.
Adaptive inference thresholds
Hyperparameters controlling how the model switches or combines modalities at test time.

axioms (1)

domain assumption Masked autoencoder pretraining on large paired multimodal medical images produces representations that transfer to multiple downstream diagnostic tasks.
Invoked implicitly when claiming generalizability from pretraining to the 17 benchmark tasks.

pith-pipeline@v0.9.0 · 5668 in / 1414 out tokens · 101248 ms · 2026-05-08T18:48:16.209801+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation (J(x)=½(x+x⁻¹)−1) washburn_uniqueness_aczel — paper's losses are MSE-based ML objectives with tunable λ's, not ratio-symmetric J-costs; no structural contact. unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The overall optimization objective is defined as L_total = λ1 L_recon + λ2 L_cross_relation + λ3 L_consistency

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 1 canonical work pages

[1]

J.et al.The lancet global health commission on global eye health: vision beyond 2020.The Lancet Global Health9, e489–e551 (2021)

Burton, M. J.et al.The lancet global health commission on global eye health: vision beyond 2020.The Lancet Global Health9, e489–e551 (2021)

2020
[2]

Bourne, R.et al.Trends in prevalence of blindness and distance and near vision impairment over 30 years: an analysis for the global burden of disease study.The Lancet global health9, e130–e143 (2021)

2021
[3]

R.et al.Global causes of blindness and distance vision impairment 1990–2020: a systematic review and meta-analysis.The Lancet Global Health5, e1221–e1234 (2017)

Flaxman, S. R.et al.Global causes of blindness and distance vision impairment 1990–2020: a systematic review and meta-analysis.The Lancet Global Health5, e1221–e1234 (2017)

1990
[4]

A.et al.Age-related macular degeneration preferred practice pattern®.Ophthalmology 132, P1–P74 (2025)

Vemulakonda, G. A.et al.Age-related macular degeneration preferred practice pattern®.Ophthalmology 132, P1–P74 (2025)

2025
[5]

I.et al.Diabetic retinopathy preferred practice pattern®.Ophthalmology132, P75–P162 (2025)

Lim, J. I.et al.Diabetic retinopathy preferred practice pattern®.Ophthalmology132, P75–P162 (2025)

2025
[6]

J.et al.Primary open-angle glaucoma preferred practice pattern®.Ophthalmology128, P71– P150 (2021)

Gedde, S. J.et al.Primary open-angle glaucoma preferred practice pattern®.Ophthalmology128, P71– P150 (2021)

2021
[7]

R.et al.Consensus definition for atrophy associated with age-related macular degeneration on oct: classification of atrophy report 3.Ophthalmology125, 537–548 (2018)

Sadda, S. R.et al.Consensus definition for atrophy associated with age-related macular degeneration on oct: classification of atrophy report 3.Ophthalmology125, 537–548 (2018)

2018
[8]

U., Grinton, M., Mandelcorn, E

Pandya, B. U., Grinton, M., Mandelcorn, E. D. & Felfeli, T. Retinal optical coherence tomography imaging biomarkers: a review of the literature.Retina44, 369–380 (2024)

2024
[9]

Ahn, S. J. Retinal thickness analysis using optical coherence tomography: diagnostic and monitoring applications in retinal diseases.Diagnostics15, 833 (2025)

2025
[10]

K.et al.Automated 3-d intraretinal layer segmentation of macular spectral-domain optical coherence tomography images.IEEE transactions on medical imaging28, 1436–1447 (2009)

Garvin, M. K.et al.Automated 3-d intraretinal layer segmentation of macular spectral-domain optical coherence tomography images.IEEE transactions on medical imaging28, 1436–1447 (2009)

2009
[12]

Liu, X.et al.A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis.The lancet digital health1, e271– e297 (2019)

2019
[13]

M.et al.International evaluation of an ai system for breast cancer screening.Nature577, 89–94 (2020)

McKinney, S. M.et al.International evaluation of an ai system for breast cancer screening.Nature577, 89–94 (2020)

2020
[14]

Ting, D. S. W.et al.Artificial intelligence and deep learning in ophthalmology.British Journal of Oph- thalmology103, 167–175 (2019)

2019
[15]

D., Lavin, P

Abr `amoff, M. D., Lavin, P. T., Birch, M., Shah, N. & Folk, J. C. Pivotal trial of an autonomous ai-based diagnostic system for detection of diabetic retinopathy in primary care offices.NPJ digital medicine1, 39 (2018)

2018
[16]

Gulshan, V .et al.Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.jama316, 2402–2410 (2016)

2016
[17]

De Fauw, J.et al.Clinically applicable deep learning for diagnosis and referral in retinal disease.Nature medicine24, 1342–1350 (2018)

2018
[18]

InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 16000–16009 (2022)

He, K.et al.Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 16000–16009 (2022)

2022
[19]

Zhou, Y .et al.A foundation model for generalizable disease detection from retinal images.Nature622, 156–163 (2023). 22

2023
[20]

Qiu, J.et al.Development and validation of a multimodal multitask vision foundation model for generalist ophthalmic artificial intelligence.NEJM AI1, AIoa2300221 (2024)

2024
[21]

Liu, Z.et al.Octcube-m: A 3d multimodal optical coherence tomography foundation model for retinal and systemic diseases with cross-cohort and cross-device validation.arXiv preprint arXiv:2408.11227(2024)

work page arXiv 2024
[22]

Schmidt-Erfurth, U.et al.Guidelines for the management of neovascular age-related macular degeneration by the european society of retina specialists (euretina).British Journal of Ophthalmology98, 1144–1167 (2014)

2014
[23]

F., Klancnik, J

Spaide, R. F., Klancnik, J. M. & Cooney, M. J. Retinal vascular layers imaged by fluorescein angiography and optical coherence tomography angiography.JAMA ophthalmology133, 45–50 (2015)

2015
[24]

B., Sarraf, D., Mieler, W

Freund, K. B., Sarraf, D., Mieler, W. F. & Yannuzzi, L. A. The retinal atlas (2nd ed.) (2016)

2016
[25]

Feo, A.et al.En face oct: breakthroughs in understanding the pathoanatomy of retinal disease and clinical applications.Progress in retinal and eye research106, 101351 (2025)

2025
[26]

URLhttps://www.sciencedirect

Liu, J.et al.Multimodal imaging and en face oct detection of calcified drusen in eyes with age-related mac- ular degeneration.Ophthalmology Science2, 100162 (2022). URLhttps://www.sciencedirect. com/science/article/pii/S2666914522000513

2022
[27]

& Sarraf, D

Feo, A. & Sarraf, D. En face optical coherence tomography and oct angiography in the pathoanatomy of inflammatory macular disease.American Journal of Ophthalmology284, 110–122 (2026). URLhttps: //www.sciencedirect.com/science/article/pii/S0002939425006890

2026
[28]

URLhttps://www.sciencedirect.com/science/article/pii/ S2666914522000057

Laiginhas, R.et al.Multimodal imaging, oct b-scan localization, and en face oct detection of macu- lar hyperpigmentation in eyes with intermediate age-related macular degeneration.Ophthalmology Sci- ence2, 100116 (2022). URLhttps://www.sciencedirect.com/science/article/pii/ S2666914522000057

2022
[29]

& Cohen, F

Chhablani, J. & Cohen, F. Central serous chorioretinopathy international group.Multimodal imaging- based central serous chorioretinopathy classification. Ophthalmol Retina4, 1043–6 (2020)

2020
[30]

& Akai, R

Tsuboi, K., Fukushima, M. & Akai, R. How optical coherence tomography has changed the management of macular holes: A narrative review.Taiwan Journal of Ophthalmology15, 344–353 (2025)

2025
[31]

T.et al.Idiopathic epiretinal membrane and vitreomacular traction preferred practice pattern®

Bailey, S. T.et al.Idiopathic epiretinal membrane and vitreomacular traction preferred practice pattern®. Ophthalmology132, P197–P233 (2025)

2025
[32]

Chen, J. C. & Lee, L. R. Clinical spectrum of lamellar macular defects including pseudoholes and pseudo- cysts defined by optical coherence tomography.British Journal of Ophthalmology92, 1342–1346 (2008)

2008
[33]

You, L.et al.The impact of aging on ocular diseases: unveiling complex interactions.Aging and disease 16, 2803 (2024)

2024
[34]

& Zeppieri, M

Ramovecchi, P., Salati, C. & Zeppieri, M. Spontaneous posterior vitreous detachment: A glance at the current literature.World Journal of Experimental Medicine11, 30 (2021)

2021
[35]

& Munk, M

Tillmann, A., Ceklic, L., Dysli, C. & Munk, M. R. Gender differences in retinal diseases: A review. Clinical & experimental ophthalmology52, 317–333 (2024). 23 A UF Data Processing and Description Figure 8:Data curation and cohort construction pipeline for retinal OCT.The pipeline summarizes the integration of raw retinal imaging archives and image metada...

2024