Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression
Pith reviewed 2026-06-29 18:03 UTC · model grok-4.3
The pith
Multimodal fusion for mean deviation prediction performs worse than single modalities until a rebalancing fix is applied.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that multimodal regression for mean deviation fails because a coupled imbalance between data distribution and modality learning conflict distorts the optimization landscape and produces unstable training; Re-M3Dr corrects this by strengthening unimodal representations with adaptive-margin supervised contrastive learning and then applying sharpness-aware gradient modulation to the joint objective, delivering an average 29 percent MSE reduction versus state-of-the-art multimodal baselines on both public and private clinical datasets.
What carries the argument
The Re-M3Dr framework, which first applies adaptive-margin supervised contrastive learning to each modality separately and then uses sharpness-aware gradient modulation during joint multimodal optimization.
If this is right
- Correcting the identified imbalance allows multimodal models to outperform both unimodal and prior multimodal approaches for mean deviation regression.
- Adaptive-margin contrastive learning improves the quality of individual modality features before fusion.
- Sharpness-aware gradient modulation reduces training instability in multimodal regression settings.
- The performance gain holds across both public benchmark and private clinical collections.
Where Pith is reading between the lines
- Similar distribution-versus-conflict imbalances may appear in other medical regression tasks that fuse complementary but unequally sampled image types.
- The same two-stage rebalancing could be tested on non-ophthalmic multimodal regression problems where one modality dominates the data distribution.
- Ablation experiments could isolate whether the contrastive pre-step or the modulation step contributes more to the observed MSE drop.
Load-bearing premise
The root cause is correctly identified as the coupled imbalance between data distribution and modality learning conflict, and the two proposed components fix it without introducing new instabilities or requiring dataset-specific tuning.
What would settle it
An experiment on the same public and private datasets in which Re-M3Dr still yields higher MSE than the best unimodal baseline after the contrastive and modulation steps are applied.
Figures
read the original abstract
Mean Deviation (MD) is a critical metric for assessing visual field loss in ophthalmology. While previous work has focused solely on predicting MD from Optical Coherence Tomography (OCT), it is intuitive to assume that combining OCT with another imaging of fundus photography (FP) could improve performance, as two ophthalmic medical imaging provide complementary information. This is particularly expected when sophisticated multi-objective optimization is applied, as documented in common multimodal classification. Surprisingly, our investigations reveal that multimodal fusion in this medical imaging scenario performs worse than unimodal model. Through detailed analysis, we identify the root cause as a coupled imbalance between data distribution and modality learning conflict. This imbalance distorts the optimization landscape, leading to unstable training. To address this challenge, we propose the method of Rebalanced MultiModal Mean Deviation Regression (Re-M3Dr), a novel multimodal regression framework. We enhance unimodal representation through adaptive margin based supervised contrastive learning. Then, our framework stabilizes the joint optimization with the sharpness-aware gradient modulation. Experimental results on both public and private clinical datasets show average 29\% reduction in MSE compared to SOTA multimodal learning methods, demonstrating the superiority of Re-M3Dr. The code is available in the supplementary materials.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that multimodal fusion of OCT and fundus photography for Mean Deviation (MD) regression underperforms unimodal baselines due to a coupled imbalance between data distribution and modality learning conflict that distorts the optimization landscape. It proposes Re-M3Dr, which applies adaptive-margin supervised contrastive learning to enhance unimodal representations followed by sharpness-aware gradient modulation to stabilize joint training, and reports an average 29% MSE reduction versus SOTA multimodal methods on both public and private clinical datasets.
Significance. If the performance gain is shown to arise specifically from the proposed rebalancing rather than generic regularization, the work would offer a practical framework for multimodal regression in ophthalmology and similar medical imaging domains where modality imbalance is common. Reproducibility is aided by the stated code availability.
major comments (3)
- [Abstract, §3] Abstract and §3: The diagnosis that multimodal underperformance stems specifically from 'coupled imbalance between data distribution and modality learning conflict' is presented as the result of 'detailed analysis,' yet no quantitative diagnostics (gradient conflict metrics, per-modality loss curves, or optimization landscape visualizations) are referenced to link the observed MSE gap to this mechanism rather than alternatives such as fusion architecture or hyperparameter effects.
- [§4] §4 (Experiments): The central 29% average MSE reduction claim is load-bearing, but the abstract and reported results provide no ablation studies isolating the contribution of adaptive-margin contrastive learning versus sharpness-aware modulation, no error bars or statistical significance tests, and no dataset statistics (class imbalance ratios, modality-specific sample counts) that would allow verification of the imbalance diagnosis.
- [§4] §4: Reliance on a private clinical dataset without disclosed controls for data leakage, patient-level splitting, or external validation raises concerns about whether the reported gains generalize or are reproducible by the community.
minor comments (2)
- [§3] Notation for the adaptive margin and sharpness-aware terms should be defined explicitly with equations in §3 to allow direct comparison with prior contrastive and SAM literature.
- [Abstract] The abstract states 'the code is available in the supplementary materials' but does not specify the exact repository or commit; this should be clarified for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on clarifying the root-cause analysis, strengthening the experimental section, and addressing reproducibility. We respond to each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3: The diagnosis that multimodal underperformance stems specifically from 'coupled imbalance between data distribution and modality learning conflict' is presented as the result of 'detailed analysis,' yet no quantitative diagnostics (gradient conflict metrics, per-modality loss curves, or optimization landscape visualizations) are referenced to link the observed MSE gap to this mechanism rather than alternatives such as fusion architecture or hyperparameter effects.
Authors: Section 3 describes the empirical observation that multimodal fusion underperforms unimodal baselines and attributes this to the coupled imbalance after examining training dynamics and modality-specific contributions. While explicit quantitative metrics were not tabulated in the current version, the analysis is grounded in those observations. To address the concern directly, we will add gradient conflict metrics, per-modality loss curves, and optimization landscape visualizations to the revised §3. revision: yes
-
Referee: [§4] §4 (Experiments): The central 29% average MSE reduction claim is load-bearing, but the abstract and reported results provide no ablation studies isolating the contribution of adaptive-margin contrastive learning versus sharpness-aware modulation, no error bars or statistical significance tests, and no dataset statistics (class imbalance ratios, modality-specific sample counts) that would allow verification of the imbalance diagnosis.
Authors: We agree that isolating component contributions and providing statistical support are necessary. In the revised §4 we will add ablation studies separating the adaptive-margin contrastive learning from the sharpness-aware modulation, report error bars (standard deviation over multiple runs), include statistical significance tests, and supply dataset statistics such as modality-specific sample counts and imbalance ratios. revision: yes
-
Referee: [§4] §4: Reliance on a private clinical dataset without disclosed controls for data leakage, patient-level splitting, or external validation raises concerns about whether the reported gains generalize or are reproducible by the community.
Authors: Patient-level splitting was used on the private dataset to avoid leakage; we will explicitly document this procedure and all validation controls in the revised §4. The dataset cannot be released for privacy reasons, but results on the public dataset already support reproducibility. Additional external validation would require new data sources beyond the current experiments. revision: partial
Circularity Check
No significant circularity; central claim is empirical performance on held-out data.
full rationale
The paper presents an empirical method (Re-M3Dr) motivated by observed multimodal underperformance, with the key result being a measured 29% MSE reduction on public and private datasets. No derivation chain, equations, or first-principles predictions are provided that reduce by construction to fitted parameters, self-citations, or renamed inputs. The root-cause diagnosis is stated as arising from 'detailed analysis' but is not formalized mathematically or shown to be tautological. The performance claim remains falsifiable against external baselines and does not rely on load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ophthalmology120(12), 2476–2484 (2013)
Arora, K.S., Boland, M.V., Friedman, D.S., Jefferys, J.L., West, S.K., Ramulu, P.Y.: The relationship between better-eye and integrated visual field mean devia- tion and visual disability. Ophthalmology120(12), 2476–2484 (2013)
2013
-
[2]
In: International conference on machine learning
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PmLR (2020)
2020
-
[3]
Sharpness-Aware Minimization for Efficiently Improving Generalization
Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412 (2020) Abbreviated paper title 15
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[4]
In: The Twelfth International Conference on Learning Repre- sentations
Keramati, M., Meng, L., Evans, R.D.: Conr: Contrastive regularizer for deep im- balanced regression. In: The Twelfth International Conference on Learning Repre- sentations
-
[5]
In: Inter- national Conference on Learning Representations (2017)
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large- batch training for deep learning: Generalization gap and sharp minima. In: Inter- national Conference on Learning Representations (2017)
2017
-
[6]
Scientific Reports15(1), 13395 (2025)
Koyama, M., Ueno, Y., Ito, Y., Oshika, T., Tanito, M.: Automated learning of glaucomatous visual fields from oct images using a comprehensive, segmentation- free 3d convolutional neural network model. Scientific Reports15(1), 13395 (2025)
2025
-
[7]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Lai, S., Zhao, M., Zhao, Z., Chang, S., Yuan, X., Liu, H., Zhang, Q., Meng, G.: Echomen: Combating data imbalance in ejection fraction regression via multi- expert network. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 624–633. Springer (2024)
2024
-
[8]
In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)
Li, Z., Xing, Z., Liu, H., Zhu, L., Wan, L.: Anchored supervised contrastive learn- ing for long-tailed medical image regression. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). pp. 3–18. Springer (2024)
2024
-
[9]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Lyu, X., Xu, Q., Yang, Z., Lyu, S., Huang, Q.: Sse-sam: Balancing head and tail classes gradually through stage-wise sam. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 19278–19286 (2025)
2025
-
[10]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8238–8247 (2022)
2022
- [11]
-
[12]
Transactions on Machine Learning Research (2024)
Träuble, J., Hiscox, L.V., Johnson, C.L., Schönlieb, C.B., Kaminski Schierle, G.S., Aviles-Rivero, A.I.: Contrastive learning with adaptive neighborhoods for brain age prediction on 3d stiffness maps. Transactions on Machine Learning Research (2024)
2024
-
[13]
In: International Conference on Machine Learning
Wei, Y., Hu, D.: Mmpareto: Boosting multimodal learning with innocent unimodal assistance. In: International Conference on Machine Learning. pp. 52559–52572. PMLR (2024)
2024
-
[14]
Balancebench- mark: A survey for multimodal imbalance learning,
Xu, S., Cui, M., Huang, C., Wang, H., Hu, D.: Balancebenchmark: A survey for multimodal imbalance learning. arXiv preprint arXiv:2502.10816 (2025)
-
[15]
Ophthalmology Glaucoma4(1), 102–112 (2021)
Yu, H.H., Maetschke, S.R., Antony, B.J., Ishikawa, H., Wollstein, G., Schuman, J.S., Garnavi, R.: Estimating global visual field indices in glaucoma by combining macula and optic disc oct scans using 3-dimensional convolutional neural networks. Ophthalmology Glaucoma4(1), 102–112 (2021)
2021
-
[16]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Zhou, Y., Qu, Y., Xu, X., Shen, H.: Imbsam: A closer look at sharpness-aware minimization in class-imbalanced recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11345–11355 (2023)
2023
-
[17]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhou, Z., Li, L., Zhao, P., Heng, P.A., Gong, W.: Class-conditional sharpness-aware minimization for deep long-tailed recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3499–3509 (2023)
2023
-
[18]
In: International Conference on Machine Learning
Zhu, Z., Wu, J., Yu, B., Wu, L., Ma, J.: The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In: International Conference on Machine Learning. pp. 7654–7663. PMLR (2019)
2019
-
[19]
Medical Image Analysis96, 103214 (2024) 16 F
Zou, K., Lin, T., Han, Z., Wang, M., Yuan, X., Chen, H., Zhang, C., Shen, X., Fu, H.: Confidence-aware multi-modality learning for eye disease screening. Medical Image Analysis96, 103214 (2024) 16 F. Author et al
2024
-
[20]
In: Interna- tional Conference on Medical Image Computing and Computer-Assisted Interven- tion
Zou, K., Lin, T., Yuan, X., Chen, H., Shen, X., Wang, M., Fu, H.: Reliable multi- modality eye disease screening via mixture of student’st distributions. In: Interna- tional Conference on Medical Image Computing and Computer-Assisted Interven- tion. pp. 596–606. Springer (2023)
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.