Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models
Pith reviewed 2026-05-20 18:57 UTC · model grok-4.3
The pith
Second-order preconditioning with Fisher-Orthogonal Projection reduces modality competition in autoregressive multimodal models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that cross-modality gradient heterogeneity is the primary source of modality competition in unified autoregressive multimodal models. By employing second-order preconditioning and introducing a Fisher-Orthogonal Projection to suppress variance-induced conflicts, combined with multi-level variance correction and hierarchical folding, the ML-FOP-SOAP optimizer achieves more stable alignment between visual and textual objectives.
What carries the argument
Fisher-Orthogonal Projection inside the ML-FOP-SOAP optimizer, which uses second-order information to project gradients and suppress variance-induced conflicts between modalities.
If this is right
- Consistent performance gains across both visual generation and textual understanding.
- Stable optimization at batch sizes of 8192.
- Up to 1.4 times better sample efficiency than AdamW.
- Up to 1.5 times faster wall-clock training time.
Where Pith is reading between the lines
- Similar variance-correction projections might reduce conflicts when jointly training on other heterogeneous data types such as audio and video.
- The hierarchical folding approach could be tested for efficiency in other large-scale optimizers that rely on gradient accumulation.
- Examining gradient heterogeneity patterns in non-autoregressive multimodal architectures would test how specific the solution is to next-token prediction.
Load-bearing premise
Modality competition stems primarily from cross-modality gradient heterogeneity that second-order preconditioning and the Fisher-Orthogonal Projection can effectively address.
What would settle it
Running the same Janus or Emu3 experiments with the Fisher-Orthogonal Projection or multi-level variance correction disabled and finding no gain in sample efficiency or stability compared with AdamW would falsify the central claim.
Figures
read the original abstract
Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to $1.4\times$ and accelerates wall-clock training by up to $1.5\times$, offering a robust optimizer for scaling multimodal foundation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ML-FOP-SOAP, a second-order optimization framework for autoregressive multimodal models that integrates Multi-Level Variance Correction and a Fisher-Orthogonal Projection to mitigate cross-modality gradient heterogeneity. It claims that this approach stabilizes training at large batch sizes (8192) on Janus and Emu3, yielding up to 1.4× sample efficiency and 1.5× wall-clock speedups relative to AdamW by reducing trade-offs between visual generation and textual understanding.
Significance. If the performance gains prove reproducible and mechanistically tied to the proposed projection and variance correction, the work could supply a practical optimizer for large-batch scaling of multimodal foundation models, addressing a timely bottleneck in joint image-text training.
major comments (2)
- [Experiments] Experiments section: The manuscript reports gains versus AdamW but provides no ablation that retains the SOAP base optimizer while removing the Fisher-Orthogonal Projection (or the Multi-Level Variance Correction). Without this control, it is impossible to confirm that the 1.4× sample-efficiency improvement arises from suppression of cross-modality gradient conflicts rather than generic second-order preconditioning.
- [§3.2] §3.2 (hierarchical folding strategy): The central claim that the projection reduces modality competition would be strengthened by reporting inter-modality gradient cosine similarities or variance ratios before and after the projection; these diagnostics are absent from the presented results.
minor comments (1)
- [Abstract] The abstract introduces the hierarchical folding strategy without a forward reference to its equation or pseudocode; adding this would aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have updated the manuscript to strengthen the experimental validation of our claims.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The manuscript reports gains versus AdamW but provides no ablation that retains the SOAP base optimizer while removing the Fisher-Orthogonal Projection (or the Multi-Level Variance Correction). Without this control, it is impossible to confirm that the 1.4× sample-efficiency improvement arises from suppression of cross-modality gradient conflicts rather than generic second-order preconditioning.
Authors: We agree that an ablation isolating the Fisher-Orthogonal Projection and Multi-Level Variance Correction from the base SOAP optimizer is necessary to attribute gains specifically to modality-competition mitigation. In the revised manuscript we add direct comparisons of ML-FOP-SOAP against plain SOAP on both Janus and Emu3, demonstrating additional sample-efficiency and stability improvements from the proposed components beyond generic second-order preconditioning. revision: yes
-
Referee: [§3.2] §3.2 (hierarchical folding strategy): The central claim that the projection reduces modality competition would be strengthened by reporting inter-modality gradient cosine similarities or variance ratios before and after the projection; these diagnostics are absent from the presented results.
Authors: We thank the referee for this suggestion. The revised §3.2 and experiments section now include diagnostic plots of inter-modality gradient cosine similarities and variance ratios computed before versus after the Fisher-Orthogonal Projection. These metrics show a measurable reduction in cross-modality conflicts, providing direct mechanistic support for the projection's role in alleviating modality competition. revision: yes
Circularity Check
Empirical optimizer proposal with no self-referential derivations or fitted inputs presented as predictions.
full rationale
The paper proposes ML-FOP-SOAP as a practical second-order optimizer extension for multimodal training, grounded in observed gradient heterogeneity and validated through direct comparisons to AdamW on Janus and Emu3. No equations, uniqueness theorems, or parameter fits are described that reduce the central claims (e.g., Fisher-Orthogonal Projection suppressing modality conflicts) back to the inputs by construction. The framework builds on existing second-order methods like SOAP but introduces new components justified by empirical stability at large batch sizes rather than self-citation chains or ansatz smuggling. This is a standard empirical contribution whose performance claims stand or fall on the reported experiments, not on any circular reduction in the derivation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Fisher-Orthogonal Projection suppresses variance-induced modality conflicts... g⊥_diff = g_diff − (⟨g_diff, M(g_avg)⟩ / ⟨g_avg, M(g_avg)⟩ + ϵ) g_avg
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Multi-level Hierarchical Gradient Folding... z_j = FOP_accum(z_{j-1}, ḡ_{L_{j-1}+1:L_j})
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
Abreu, N., Vyas, N., Kakade, S., Morwani, D.: The potential of second-order opti- mization for llms: A study with full gauss-newton. arXiv preprint arXiv:2510.09378 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023),https://arxiv.org/abs/2308.12966
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
In: International conference on machine learning
Chen, Z., Badrinarayanan, V., Lee, C.Y., Rabinovich, A.: Gradnorm: Gradient nor- malization for adaptive loss balancing in deep multitask networks. In: International conference on machine learning. pp. 794–803. PMLR (2018)
work page 2018
-
[4]
Advances in neural information processing systems36, 49250–49267 (2023)
Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)
work page 2023
-
[5]
Acta numerica24, 259–328 (2015)
Giles, M.B.: Multilevel monte carlo methods. Acta numerica24, 259–328 (2015)
work page 2015
-
[6]
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models (2022),https:/...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
arXiv preprint arXiv:2405.07930 (2024)
Kontras, K., Chatzichristos, C., Blaschko, M., De Vos, M.: Improving multimodal learning with multi-loss gradient modulation. arXiv preprint arXiv:2405.07930 (2024)
-
[8]
Ad- vances in Neural Information Processing Systems35, 17612–17625 (2022)
Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the gap: Under- standing the modality gap in multi-modal contrastive representation learning. Ad- vances in Neural Information Processing Systems35, 17612–17625 (2022)
work page 2022
-
[9]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
work page 2023
- [10]
-
[11]
lmms-lab: Llava-recap-cc12m (2024),https://huggingface.co/datasets/lmms- lab/LLaVA-ReCap-CC12M, hugging Face dataset, accessed 2026-03-05
work page 2024
-
[12]
lmms-lab: Llava-recap-cc3m (2024),https://huggingface.co/datasets/lmms- lab/LLaVA-ReCap-CC3M, hugging Face dataset, accessed 2026-03-05
work page 2024
-
[13]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
arXiv preprint arXiv:2508.13898 (2025)
Lu, Y., Armour, W.: Beyond the mean: Fisher-orthogonal projection for natural gradient descent in large batch training. arXiv preprint arXiv:2508.13898 (2025)
-
[15]
In: International conference on machine learning
Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored ap- proximate curvature. In: International conference on machine learning. pp. 2408–
-
[16]
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning (2018),https://arxiv.org/abs/1711.00937
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Revisiting Natural Gradient for Deep Networks
Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584 (2013)
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[18]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8238–8247 (2022)
work page 2022
-
[19]
arXiv preprint arXiv:2503.06456 (2025)
Qian, C., Han, K., Liu, J., Yuan, Z., Zhu, Z., Wang, J., Lyu, C., Chen, J., Liu, Z.: Dyncim: Dynamic curriculum for imbalanced multimodal learning. arXiv preprint arXiv:2503.06456 (2025)
-
[20]
arXiv preprint arXiv:2309.06497 (2023)
Shi, H.J.M., Lee, T.H., Iwasaki, S., Gallego-Posada, J., Li, Z., Rangadurai, K., Mudigere, D., Rabbat, M.: A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale. arXiv preprint arXiv:2309.06497 (2023)
-
[21]
Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
SOAP: Improving and Stabilizing Shampoo using Adam
Vyas, N., Morwani, D., Zhao, R., Kwun, M., Shapira, I., Brandfonbrener, D., Jan- son, L., Kakade, S.: Soap: Improving and stabilizing shampoo using adam. arXiv preprint arXiv:2409.11321 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [23]
-
[24]
Advances in neural information processing systems30(2017)
Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems30(2017)
work page 2017
-
[25]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., et al.: Janus: Decoupling visual encoding for unified multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12966–12977 (2025) 16 Yishun Lu and Wes Armour
work page 2025
-
[26]
Advances in neural information processing systems33, 5824–5836 (2020)
Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery for multi-task learning. Advances in neural information processing systems33, 5824–5836 (2020)
work page 2020
-
[27]
Sigmoid Loss for Language Image Pre-Training
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training (2023),https://arxiv.org/abs/2303.15343 Abbreviated paper title 17 Supplementary Material for Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models A Theoretical Analysis of Fisher Information for Modality Balancing In this a...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
(29) This shows that, locally, the Fisher information matrix defines the intrinsic metric of the statistical manifold. Abbreviated paper title 19 A.3 Why CanF −1 Mitigate Modality Competition? The key observation follows from the information matrix equality (also known as Bartlett’s identity). For log-likelihood objectives, F=−E pθ ∇2 logp(z|θ) =E pθ ∇log...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.