pith. sign in

arxiv: 2605.26513 · v1 · pith:Q5TIZWTSnew · submitted 2026-05-26 · 💻 cs.CV

Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression

Pith reviewed 2026-06-29 18:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal regressionmean deviation predictionophthalmic imagingcontrastive learningsharpness-aware optimizationOCT and fundus photographymedical image fusionvisual field loss
0
0 comments X

The pith

Multimodal fusion for mean deviation prediction performs worse than single modalities until a rebalancing fix is applied.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that combining OCT and fundus photography images for predicting mean deviation in visual field loss produces higher error than using either modality alone. The authors trace the failure to a coupled imbalance between uneven data distributions and conflicting signals from the two imaging types that destabilizes training. They introduce Re-M3Dr, which first improves each modality's representation through adaptive-margin supervised contrastive learning and then stabilizes joint training with sharpness-aware gradient modulation. On public and private clinical datasets this yields an average 29 percent drop in mean squared error relative to prior multimodal regression methods. A reader would care because the result challenges the common assumption that more medical images always help and offers a concrete way to make complementary scans useful for diagnosis.

Core claim

The central claim is that multimodal regression for mean deviation fails because a coupled imbalance between data distribution and modality learning conflict distorts the optimization landscape and produces unstable training; Re-M3Dr corrects this by strengthening unimodal representations with adaptive-margin supervised contrastive learning and then applying sharpness-aware gradient modulation to the joint objective, delivering an average 29 percent MSE reduction versus state-of-the-art multimodal baselines on both public and private clinical datasets.

What carries the argument

The Re-M3Dr framework, which first applies adaptive-margin supervised contrastive learning to each modality separately and then uses sharpness-aware gradient modulation during joint multimodal optimization.

If this is right

  • Correcting the identified imbalance allows multimodal models to outperform both unimodal and prior multimodal approaches for mean deviation regression.
  • Adaptive-margin contrastive learning improves the quality of individual modality features before fusion.
  • Sharpness-aware gradient modulation reduces training instability in multimodal regression settings.
  • The performance gain holds across both public benchmark and private clinical collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar distribution-versus-conflict imbalances may appear in other medical regression tasks that fuse complementary but unequally sampled image types.
  • The same two-stage rebalancing could be tested on non-ophthalmic multimodal regression problems where one modality dominates the data distribution.
  • Ablation experiments could isolate whether the contrastive pre-step or the modulation step contributes more to the observed MSE drop.

Load-bearing premise

The root cause is correctly identified as the coupled imbalance between data distribution and modality learning conflict, and the two proposed components fix it without introducing new instabilities or requiring dataset-specific tuning.

What would settle it

An experiment on the same public and private datasets in which Re-M3Dr still yields higher MSE than the best unimodal baseline after the contrastive and modulation steps are applied.

Figures

Figures reproduced from arXiv: 2605.26513 by Chengcheng Feng, Haojie Yin, Kaizhu Huang, Tianqi Zhang, Tianyi Liu.

Figure 1
Figure 1. Figure 1: Mean Squared Error on MD regression datasets. to their single-modality counterparts (see [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Loss changes during unimodal training; (b) and (c) Loss changes corre￾sponding to the unimodal head under the MMPareto multimodal joint learning. The figure shows three sets: red represents the loss changes when training with all data, blue represents training with only head-class data, and green represents with only tail-class data. It can be observed that the outputs of unimodal prediction heads with… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of adaptive margin in supervised contrastive learning. P: positive; N: negative. Early in training, z1 and z3 are positives for z2, while z4 is negative. As training progresses, the margin shrinks, and distant positive z3 becomes negative. shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of SGM. For example, given the multimodal and OCT loss, the network computes the Pareto-optimal direction via MOO. MMpareto amplifies the combined gradient to compensate for the lower noise in the joint training. But this blind amplification may prevent convergence to flat minima. SGM adjusts the modulation scale based on loss geometry, enabling both escape from sharp minima and stable converg… view at source ↗
Figure 5
Figure 5. Figure 5: Figure A shows how gamma and sharpness change over iterations. Figure B illustrates the relationship (correlation) between the two. 5.5 Dynamic Sharpness-aware Modulation [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Hyperparameter sensitivity analysis of AM and SGM. 5.6 Hyperparameter Sensitivity. We analyze the sensitivity of key hyperparameters in AM and SGM. For AM, we vary the initial margin while fixing the decay rate to 0.0005 (Fig. 6a), and vary the decay rate while fixing the initial margin to 0.4 (Fig. 6b). For SGM, we evaluate the scaling factor γ ∈ {0.5, 1, 1.5, 2, 5, 10} (Fig. 6c). The results show that th… view at source ↗
read the original abstract

Mean Deviation (MD) is a critical metric for assessing visual field loss in ophthalmology. While previous work has focused solely on predicting MD from Optical Coherence Tomography (OCT), it is intuitive to assume that combining OCT with another imaging of fundus photography (FP) could improve performance, as two ophthalmic medical imaging provide complementary information. This is particularly expected when sophisticated multi-objective optimization is applied, as documented in common multimodal classification. Surprisingly, our investigations reveal that multimodal fusion in this medical imaging scenario performs worse than unimodal model. Through detailed analysis, we identify the root cause as a coupled imbalance between data distribution and modality learning conflict. This imbalance distorts the optimization landscape, leading to unstable training. To address this challenge, we propose the method of Rebalanced MultiModal Mean Deviation Regression (Re-M3Dr), a novel multimodal regression framework. We enhance unimodal representation through adaptive margin based supervised contrastive learning. Then, our framework stabilizes the joint optimization with the sharpness-aware gradient modulation. Experimental results on both public and private clinical datasets show average 29\% reduction in MSE compared to SOTA multimodal learning methods, demonstrating the superiority of Re-M3Dr. The code is available in the supplementary materials.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that multimodal fusion of OCT and fundus photography for Mean Deviation (MD) regression underperforms unimodal baselines due to a coupled imbalance between data distribution and modality learning conflict that distorts the optimization landscape. It proposes Re-M3Dr, which applies adaptive-margin supervised contrastive learning to enhance unimodal representations followed by sharpness-aware gradient modulation to stabilize joint training, and reports an average 29% MSE reduction versus SOTA multimodal methods on both public and private clinical datasets.

Significance. If the performance gain is shown to arise specifically from the proposed rebalancing rather than generic regularization, the work would offer a practical framework for multimodal regression in ophthalmology and similar medical imaging domains where modality imbalance is common. Reproducibility is aided by the stated code availability.

major comments (3)
  1. [Abstract, §3] Abstract and §3: The diagnosis that multimodal underperformance stems specifically from 'coupled imbalance between data distribution and modality learning conflict' is presented as the result of 'detailed analysis,' yet no quantitative diagnostics (gradient conflict metrics, per-modality loss curves, or optimization landscape visualizations) are referenced to link the observed MSE gap to this mechanism rather than alternatives such as fusion architecture or hyperparameter effects.
  2. [§4] §4 (Experiments): The central 29% average MSE reduction claim is load-bearing, but the abstract and reported results provide no ablation studies isolating the contribution of adaptive-margin contrastive learning versus sharpness-aware modulation, no error bars or statistical significance tests, and no dataset statistics (class imbalance ratios, modality-specific sample counts) that would allow verification of the imbalance diagnosis.
  3. [§4] §4: Reliance on a private clinical dataset without disclosed controls for data leakage, patient-level splitting, or external validation raises concerns about whether the reported gains generalize or are reproducible by the community.
minor comments (2)
  1. [§3] Notation for the adaptive margin and sharpness-aware terms should be defined explicitly with equations in §3 to allow direct comparison with prior contrastive and SAM literature.
  2. [Abstract] The abstract states 'the code is available in the supplementary materials' but does not specify the exact repository or commit; this should be clarified for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying the root-cause analysis, strengthening the experimental section, and addressing reproducibility. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3: The diagnosis that multimodal underperformance stems specifically from 'coupled imbalance between data distribution and modality learning conflict' is presented as the result of 'detailed analysis,' yet no quantitative diagnostics (gradient conflict metrics, per-modality loss curves, or optimization landscape visualizations) are referenced to link the observed MSE gap to this mechanism rather than alternatives such as fusion architecture or hyperparameter effects.

    Authors: Section 3 describes the empirical observation that multimodal fusion underperforms unimodal baselines and attributes this to the coupled imbalance after examining training dynamics and modality-specific contributions. While explicit quantitative metrics were not tabulated in the current version, the analysis is grounded in those observations. To address the concern directly, we will add gradient conflict metrics, per-modality loss curves, and optimization landscape visualizations to the revised §3. revision: yes

  2. Referee: [§4] §4 (Experiments): The central 29% average MSE reduction claim is load-bearing, but the abstract and reported results provide no ablation studies isolating the contribution of adaptive-margin contrastive learning versus sharpness-aware modulation, no error bars or statistical significance tests, and no dataset statistics (class imbalance ratios, modality-specific sample counts) that would allow verification of the imbalance diagnosis.

    Authors: We agree that isolating component contributions and providing statistical support are necessary. In the revised §4 we will add ablation studies separating the adaptive-margin contrastive learning from the sharpness-aware modulation, report error bars (standard deviation over multiple runs), include statistical significance tests, and supply dataset statistics such as modality-specific sample counts and imbalance ratios. revision: yes

  3. Referee: [§4] §4: Reliance on a private clinical dataset without disclosed controls for data leakage, patient-level splitting, or external validation raises concerns about whether the reported gains generalize or are reproducible by the community.

    Authors: Patient-level splitting was used on the private dataset to avoid leakage; we will explicitly document this procedure and all validation controls in the revised §4. The dataset cannot be released for privacy reasons, but results on the public dataset already support reproducibility. Additional external validation would require new data sources beyond the current experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; central claim is empirical performance on held-out data.

full rationale

The paper presents an empirical method (Re-M3Dr) motivated by observed multimodal underperformance, with the key result being a measured 29% MSE reduction on public and private datasets. No derivation chain, equations, or first-principles predictions are provided that reduce by construction to fitted parameters, self-citations, or renamed inputs. The root-cause diagnosis is stated as arising from 'detailed analysis' but is not formalized mathematically or shown to be tautological. The performance claim remains falsifiable against external baselines and does not rely on load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities. The method implicitly relies on standard supervised contrastive and sharpness-aware optimization assumptions common to deep learning, none of which are enumerated or justified in the provided text.

pith-pipeline@v0.9.1-grok · 5751 in / 1359 out tokens · 27473 ms · 2026-06-29T18:03:53.946909+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Ophthalmology120(12), 2476–2484 (2013)

    Arora, K.S., Boland, M.V., Friedman, D.S., Jefferys, J.L., West, S.K., Ramulu, P.Y.: The relationship between better-eye and integrated visual field mean devia- tion and visual disability. Ophthalmology120(12), 2476–2484 (2013)

  2. [2]

    In: International conference on machine learning

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PmLR (2020)

  3. [3]

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412 (2020) Abbreviated paper title 15

  4. [4]

    In: The Twelfth International Conference on Learning Repre- sentations

    Keramati, M., Meng, L., Evans, R.D.: Conr: Contrastive regularizer for deep im- balanced regression. In: The Twelfth International Conference on Learning Repre- sentations

  5. [5]

    In: Inter- national Conference on Learning Representations (2017)

    Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large- batch training for deep learning: Generalization gap and sharp minima. In: Inter- national Conference on Learning Representations (2017)

  6. [6]

    Scientific Reports15(1), 13395 (2025)

    Koyama, M., Ueno, Y., Ito, Y., Oshika, T., Tanito, M.: Automated learning of glaucomatous visual fields from oct images using a comprehensive, segmentation- free 3d convolutional neural network model. Scientific Reports15(1), 13395 (2025)

  7. [7]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Lai, S., Zhao, M., Zhao, Z., Chang, S., Yuan, X., Liu, H., Zhang, Q., Meng, G.: Echomen: Combating data imbalance in ejection fraction regression via multi- expert network. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 624–633. Springer (2024)

  8. [8]

    In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

    Li, Z., Xing, Z., Liu, H., Zhu, L., Wan, L.: Anchored supervised contrastive learn- ing for long-tailed medical image regression. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). pp. 3–18. Springer (2024)

  9. [9]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Lyu, X., Xu, Q., Yang, Z., Lyu, S., Huang, Q.: Sse-sam: Balancing head and tail classes gradually through stage-wise sam. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 19278–19286 (2025)

  10. [10]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8238–8247 (2022)

  11. [11]

    Phan, H., Tran, L., Tran, N.N., Ho, N., Phung, D., Le, T.: Improving multi-task learningviaseekingtask-basedflatregions.arXivpreprintarXiv:2211.13723(2022)

  12. [12]

    Transactions on Machine Learning Research (2024)

    Träuble, J., Hiscox, L.V., Johnson, C.L., Schönlieb, C.B., Kaminski Schierle, G.S., Aviles-Rivero, A.I.: Contrastive learning with adaptive neighborhoods for brain age prediction on 3d stiffness maps. Transactions on Machine Learning Research (2024)

  13. [13]

    In: International Conference on Machine Learning

    Wei, Y., Hu, D.: Mmpareto: Boosting multimodal learning with innocent unimodal assistance. In: International Conference on Machine Learning. pp. 52559–52572. PMLR (2024)

  14. [14]

    Balancebench- mark: A survey for multimodal imbalance learning,

    Xu, S., Cui, M., Huang, C., Wang, H., Hu, D.: Balancebenchmark: A survey for multimodal imbalance learning. arXiv preprint arXiv:2502.10816 (2025)

  15. [15]

    Ophthalmology Glaucoma4(1), 102–112 (2021)

    Yu, H.H., Maetschke, S.R., Antony, B.J., Ishikawa, H., Wollstein, G., Schuman, J.S., Garnavi, R.: Estimating global visual field indices in glaucoma by combining macula and optic disc oct scans using 3-dimensional convolutional neural networks. Ophthalmology Glaucoma4(1), 102–112 (2021)

  16. [16]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhou, Y., Qu, Y., Xu, X., Shen, H.: Imbsam: A closer look at sharpness-aware minimization in class-imbalanced recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11345–11355 (2023)

  17. [17]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhou, Z., Li, L., Zhao, P., Heng, P.A., Gong, W.: Class-conditional sharpness-aware minimization for deep long-tailed recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3499–3509 (2023)

  18. [18]

    In: International Conference on Machine Learning

    Zhu, Z., Wu, J., Yu, B., Wu, L., Ma, J.: The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In: International Conference on Machine Learning. pp. 7654–7663. PMLR (2019)

  19. [19]

    Medical Image Analysis96, 103214 (2024) 16 F

    Zou, K., Lin, T., Han, Z., Wang, M., Yuan, X., Chen, H., Zhang, C., Shen, X., Fu, H.: Confidence-aware multi-modality learning for eye disease screening. Medical Image Analysis96, 103214 (2024) 16 F. Author et al

  20. [20]

    In: Interna- tional Conference on Medical Image Computing and Computer-Assisted Interven- tion

    Zou, K., Lin, T., Yuan, X., Chen, H., Shen, X., Wang, M., Fu, H.: Reliable multi- modality eye disease screening via mixture of student’st distributions. In: Interna- tional Conference on Medical Image Computing and Computer-Assisted Interven- tion. pp. 596–606. Springer (2023)