Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

Wes Armour; Yishun Lu

arxiv: 2605.16165 · v1 · pith:PCUROFJ3new · submitted 2026-05-15 · 💻 cs.CV · cs.AI

Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

Yishun Lu , Wes Armour This is my paper

Pith reviewed 2026-05-20 18:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal modelsmodality competitionsecond-order optimizationvariance correctionFisher-Orthogonal Projectionlarge-batch trainingautoregressive training

0 comments

The pith

Second-order preconditioning with Fisher-Orthogonal Projection reduces modality competition in autoregressive multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive next-token training unifies image generation and text understanding but creates strong competition between modalities because their gradients differ substantially. First-order optimizers like AdamW are particularly vulnerable to this cross-modality gradient heterogeneity, which destabilizes training and limits scaling to large batches. The paper introduces ML-FOP-SOAP, a second-order framework that adds Multi-Level Variance Correction and a Fisher-Orthogonal Projection to suppress variance-induced conflicts. A hierarchical folding strategy keeps the overhead low during gradient accumulation. Experiments on Janus and Emu3 report gains on both modalities plus stable training at batch size 8192.

Core claim

The authors establish that cross-modality gradient heterogeneity is the primary source of modality competition in unified autoregressive multimodal models. By employing second-order preconditioning and introducing a Fisher-Orthogonal Projection to suppress variance-induced conflicts, combined with multi-level variance correction and hierarchical folding, the ML-FOP-SOAP optimizer achieves more stable alignment between visual and textual objectives.

What carries the argument

Fisher-Orthogonal Projection inside the ML-FOP-SOAP optimizer, which uses second-order information to project gradients and suppress variance-induced conflicts between modalities.

If this is right

Consistent performance gains across both visual generation and textual understanding.
Stable optimization at batch sizes of 8192.
Up to 1.4 times better sample efficiency than AdamW.
Up to 1.5 times faster wall-clock training time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar variance-correction projections might reduce conflicts when jointly training on other heterogeneous data types such as audio and video.
The hierarchical folding approach could be tested for efficiency in other large-scale optimizers that rely on gradient accumulation.
Examining gradient heterogeneity patterns in non-autoregressive multimodal architectures would test how specific the solution is to next-token prediction.

Load-bearing premise

Modality competition stems primarily from cross-modality gradient heterogeneity that second-order preconditioning and the Fisher-Orthogonal Projection can effectively address.

What would settle it

Running the same Janus or Emu3 experiments with the Fisher-Orthogonal Projection or multi-level variance correction disabled and finding no gain in sample efficiency or stability compared with AdamW would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.16165 by Wes Armour, Yishun Lu.

**Figure 1.** Figure 1: 2x2 train-loss comparison for pretraining Janus-400M. Left column: SHAMPOO family; right column: SOAP family. Top row: loss vs trained tokens; bottom row: loss vs wallclock time [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗

**Figure 2.** Figure 2: Decoupled training loss curves for image-to-text (I2T) and text-to-image (T2I) tasks across Janus-400M (left) and Emu3-600M (right) architectures. To explicitly verify how the proposed strategies resolve modality competition, the overall objective is decoupled into Image-to-Text (I2T) and Text-toImage (T2I) losses in [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Training loss convergence across Janus-400M (left) and Emu3-600M (right) under a highly scaled batch size setting (Batch size = 8192). The top row tracks loss against processed Train Tokens, while the bottom row tracks it against Wallclock Time. a stronger foundation by accelerating the dominant T2I task significantly faster than AdamW. However, while SOAP improves T2I generation, its corresponding I2T los… view at source ↗

read the original abstract

Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to $1.4\times$ and accelerates wall-clock training by up to $1.5\times$, offering a robust optimizer for scaling multimodal foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ML-FOP-SOAP reports efficiency gains on Janus and Emu3 but the experiments do not isolate whether the Fisher-Orthogonal Projection is doing the claimed work on modality conflicts.

read the letter

The main takeaway is that this paper packages a second-order optimizer called ML-FOP-SOAP, built on SOAP with added Multi-Level Variance Correction, Fisher-Orthogonal Projection, and hierarchical folding, to reduce instability from cross-modality gradient differences in autoregressive multimodal training. They show stable runs at batch size 8192 and up to 1.4x sample efficiency plus 1.5x wall-clock improvement over AdamW on Janus and Emu3, with gains holding for both vision and language tasks. That practical scaling angle is the useful part if the numbers check out under scrutiny. The engineering choices around low-overhead variance capture during accumulation look like a reasonable way to make second-order methods viable at these sizes. The paper does engage directly with the optimization barrier that shows up when mixing modalities in next-token prediction. The soft spot is the missing isolation. The abstract and reported results compare only to AdamW, with no ablation that keeps the SOAP base but drops the projection or the multi-level correction. There are also no supporting measurements of gradient cosine similarity or variance ratios between modalities before and after the projection, so it is difficult to confirm that the claimed mechanism is what produces the speedups rather than generic second-order preconditioning. If those controls are in the full paper they would tighten the argument; without them the central claim rests mostly on the end-to-end numbers. This is for people who train large multimodal models and need concrete optimizer tweaks that survive big batches. A practitioner or scaling researcher would get value from the reported trade-off reductions even if the theory needs more support. It deserves peer review because the empirical results on real models are worth checking, though any referee would likely ask for the ablations and gradient diagnostics to strengthen the mechanistic story.

Referee Report

2 major / 1 minor

Summary. The paper proposes ML-FOP-SOAP, a second-order optimization framework for autoregressive multimodal models that integrates Multi-Level Variance Correction and a Fisher-Orthogonal Projection to mitigate cross-modality gradient heterogeneity. It claims that this approach stabilizes training at large batch sizes (8192) on Janus and Emu3, yielding up to 1.4× sample efficiency and 1.5× wall-clock speedups relative to AdamW by reducing trade-offs between visual generation and textual understanding.

Significance. If the performance gains prove reproducible and mechanistically tied to the proposed projection and variance correction, the work could supply a practical optimizer for large-batch scaling of multimodal foundation models, addressing a timely bottleneck in joint image-text training.

major comments (2)

[Experiments] Experiments section: The manuscript reports gains versus AdamW but provides no ablation that retains the SOAP base optimizer while removing the Fisher-Orthogonal Projection (or the Multi-Level Variance Correction). Without this control, it is impossible to confirm that the 1.4× sample-efficiency improvement arises from suppression of cross-modality gradient conflicts rather than generic second-order preconditioning.
[§3.2] §3.2 (hierarchical folding strategy): The central claim that the projection reduces modality competition would be strengthened by reporting inter-modality gradient cosine similarities or variance ratios before and after the projection; these diagnostics are absent from the presented results.

minor comments (1)

[Abstract] The abstract introduces the hierarchical folding strategy without a forward reference to its equation or pseudocode; adding this would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have updated the manuscript to strengthen the experimental validation of our claims.

read point-by-point responses

Referee: [Experiments] Experiments section: The manuscript reports gains versus AdamW but provides no ablation that retains the SOAP base optimizer while removing the Fisher-Orthogonal Projection (or the Multi-Level Variance Correction). Without this control, it is impossible to confirm that the 1.4× sample-efficiency improvement arises from suppression of cross-modality gradient conflicts rather than generic second-order preconditioning.

Authors: We agree that an ablation isolating the Fisher-Orthogonal Projection and Multi-Level Variance Correction from the base SOAP optimizer is necessary to attribute gains specifically to modality-competition mitigation. In the revised manuscript we add direct comparisons of ML-FOP-SOAP against plain SOAP on both Janus and Emu3, demonstrating additional sample-efficiency and stability improvements from the proposed components beyond generic second-order preconditioning. revision: yes
Referee: [§3.2] §3.2 (hierarchical folding strategy): The central claim that the projection reduces modality competition would be strengthened by reporting inter-modality gradient cosine similarities or variance ratios before and after the projection; these diagnostics are absent from the presented results.

Authors: We thank the referee for this suggestion. The revised §3.2 and experiments section now include diagnostic plots of inter-modality gradient cosine similarities and variance ratios computed before versus after the Fisher-Orthogonal Projection. These metrics show a measurable reduction in cross-modality conflicts, providing direct mechanistic support for the projection's role in alleviating modality competition. revision: yes

Circularity Check

0 steps flagged

Empirical optimizer proposal with no self-referential derivations or fitted inputs presented as predictions.

full rationale

The paper proposes ML-FOP-SOAP as a practical second-order optimizer extension for multimodal training, grounded in observed gradient heterogeneity and validated through direct comparisons to AdamW on Janus and Emu3. No equations, uniqueness theorems, or parameter fits are described that reduce the central claims (e.g., Fisher-Orthogonal Projection suppressing modality conflicts) back to the inputs by construction. The framework builds on existing second-order methods like SOAP but introduces new components justified by empirical stability at large batch sizes rather than self-citation chains or ansatz smuggling. This is a standard empirical contribution whose performance claims stand or fall on the reported experiments, not on any circular reduction in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are detailed. New terms such as ML-FOP-SOAP and Fisher-Orthogonal Projection appear to be introduced but lack supporting definitions or evidence.

pith-pipeline@v0.9.0 · 5715 in / 1225 out tokens · 54512 ms · 2026-05-20T18:57:46.568523+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Fisher-Orthogonal Projection suppresses variance-induced modality conflicts... g⊥_diff = g_diff − (⟨g_diff, M(g_avg)⟩ / ⟨g_avg, M(g_avg)⟩ + ϵ) g_avg
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Multi-level Hierarchical Gradient Folding... z_j = FOP_accum(z_{j-1}, ḡ_{L_{j-1}+1:L_j})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 9 internal anchors

[1]

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Abreu, N., Vyas, N., Kakade, S., Morwani, D.: The potential of second-order opti- mization for llms: A study with full gauss-newton. arXiv preprint arXiv:2510.09378 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023),https://arxiv.org/abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

In: International conference on machine learning

Chen, Z., Badrinarayanan, V., Lee, C.Y., Rabinovich, A.: Gradnorm: Gradient nor- malization for adaptive loss balancing in deep multitask networks. In: International conference on machine learning. pp. 794–803. PMLR (2018)

work page 2018
[4]

Advances in neural information processing systems36, 49250–49267 (2023)

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

work page 2023
[5]

Acta numerica24, 259–328 (2015)

Giles, M.B.: Multilevel monte carlo methods. Acta numerica24, 259–328 (2015)

work page 2015
[6]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models (2022),https:/...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

arXiv preprint arXiv:2405.07930 (2024)

Kontras, K., Chatzichristos, C., Blaschko, M., De Vos, M.: Improving multimodal learning with multi-loss gradient modulation. arXiv preprint arXiv:2405.07930 (2024)

work page arXiv 2024
[8]

Ad- vances in Neural Information Processing Systems35, 17612–17625 (2022)

Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the gap: Under- standing the modality gap in multi-modal contrastive representation learning. Ad- vances in Neural Information Processing Systems35, 17612–17625 (2022)

work page 2022
[9]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

work page 2023
[10]

Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., Zhang, L., Gao, J., Li, C.: Llava-plus: Learning to use tools for creating multimodal agents (2023),https://arxiv.org/abs/2311.05437

work page arXiv 2023
[11]

lmms-lab: Llava-recap-cc12m (2024),https://huggingface.co/datasets/lmms- lab/LLaVA-ReCap-CC12M, hugging Face dataset, accessed 2026-03-05

work page 2024
[12]

lmms-lab: Llava-recap-cc3m (2024),https://huggingface.co/datasets/lmms- lab/LLaVA-ReCap-CC3M, hugging Face dataset, accessed 2026-03-05

work page 2024
[13]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

arXiv preprint arXiv:2508.13898 (2025)

Lu, Y., Armour, W.: Beyond the mean: Fisher-orthogonal projection for natural gradient descent in large batch training. arXiv preprint arXiv:2508.13898 (2025)

work page arXiv 2025
[15]

In: International conference on machine learning

Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored ap- proximate curvature. In: International conference on machine learning. pp. 2408–

work page
[16]

van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning (2018),https://arxiv.org/abs/1711.00937

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Revisiting Natural Gradient for Deep Networks

Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[18]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8238–8247 (2022)

work page 2022
[19]

arXiv preprint arXiv:2503.06456 (2025)

Qian, C., Han, K., Liu, J., Yuan, Z., Zhu, Z., Wang, J., Lyu, C., Chen, J., Liu, Z.: Dyncim: Dynamic curriculum for imbalanced multimodal learning. arXiv preprint arXiv:2503.06456 (2025)

work page arXiv 2025
[20]

arXiv preprint arXiv:2309.06497 (2023)

Shi, H.J.M., Lee, T.H., Iwasaki, S., Gallego-Posada, J., Li, Z., Rangadurai, K., Mudigere, D., Rabbat, M.: A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale. arXiv preprint arXiv:2309.06497 (2023)

work page arXiv 2023
[21]

Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

SOAP: Improving and Stabilizing Shampoo using Adam

Vyas, N., Morwani, D., Zhao, R., Kwun, M., Shapira, I., Brandfonbrener, D., Jan- son, L., Kakade, S.: Soap: Improving and stabilizing shampoo using adam. arXiv preprint arXiv:2409.11321 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Nature pp

Wang, X., Cui, Y., Wang, J., Zhang, F., Wang, Y., Zhang, X., Luo, Z., Sun, Q., Li, Z., Wang, Y., et al.: Multimodal learning with next-token prediction for large multimodal models. Nature pp. 1–7 (2026)

work page 2026
[24]

Advances in neural information processing systems30(2017)

Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems30(2017)

work page 2017
[25]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., et al.: Janus: Decoupling visual encoding for unified multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12966–12977 (2025) 16 Yishun Lu and Wes Armour

work page 2025
[26]

Advances in neural information processing systems33, 5824–5836 (2020)

Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery for multi-task learning. Advances in neural information processing systems33, 5824–5836 (2020)

work page 2020
[27]

Sigmoid Loss for Language Image Pre-Training

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training (2023),https://arxiv.org/abs/2303.15343 Abbreviated paper title 17 Supplementary Material for Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models A Theoretical Analysis of Fisher Information for Modality Balancing In this a...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Abbreviated paper title 19 A.3 Why CanF −1 Mitigate Modality Competition? The key observation follows from the information matrix equality (also known as Bartlett’s identity)

(29) This shows that, locally, the Fisher information matrix defines the intrinsic metric of the statistical manifold. Abbreviated paper title 19 A.3 Why CanF −1 Mitigate Modality Competition? The key observation follows from the information matrix equality (also known as Bartlett’s identity). For log-likelihood objectives, F=−E pθ ∇2 logp(z|θ) =E pθ ∇log...

work page

[1] [1]

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Abreu, N., Vyas, N., Kakade, S., Morwani, D.: The potential of second-order opti- mization for llms: A study with full gauss-newton. arXiv preprint arXiv:2510.09378 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023),https://arxiv.org/abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

In: International conference on machine learning

Chen, Z., Badrinarayanan, V., Lee, C.Y., Rabinovich, A.: Gradnorm: Gradient nor- malization for adaptive loss balancing in deep multitask networks. In: International conference on machine learning. pp. 794–803. PMLR (2018)

work page 2018

[4] [4]

Advances in neural information processing systems36, 49250–49267 (2023)

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

work page 2023

[5] [5]

Acta numerica24, 259–328 (2015)

Giles, M.B.: Multilevel monte carlo methods. Acta numerica24, 259–328 (2015)

work page 2015

[6] [6]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models (2022),https:/...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

arXiv preprint arXiv:2405.07930 (2024)

Kontras, K., Chatzichristos, C., Blaschko, M., De Vos, M.: Improving multimodal learning with multi-loss gradient modulation. arXiv preprint arXiv:2405.07930 (2024)

work page arXiv 2024

[8] [8]

Ad- vances in Neural Information Processing Systems35, 17612–17625 (2022)

Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the gap: Under- standing the modality gap in multi-modal contrastive representation learning. Ad- vances in Neural Information Processing Systems35, 17612–17625 (2022)

work page 2022

[9] [9]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

work page 2023

[10] [10]

Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., Zhang, L., Gao, J., Li, C.: Llava-plus: Learning to use tools for creating multimodal agents (2023),https://arxiv.org/abs/2311.05437

work page arXiv 2023

[11] [11]

lmms-lab: Llava-recap-cc12m (2024),https://huggingface.co/datasets/lmms- lab/LLaVA-ReCap-CC12M, hugging Face dataset, accessed 2026-03-05

work page 2024

[12] [12]

lmms-lab: Llava-recap-cc3m (2024),https://huggingface.co/datasets/lmms- lab/LLaVA-ReCap-CC3M, hugging Face dataset, accessed 2026-03-05

work page 2024

[13] [13]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

arXiv preprint arXiv:2508.13898 (2025)

Lu, Y., Armour, W.: Beyond the mean: Fisher-orthogonal projection for natural gradient descent in large batch training. arXiv preprint arXiv:2508.13898 (2025)

work page arXiv 2025

[15] [15]

In: International conference on machine learning

Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored ap- proximate curvature. In: International conference on machine learning. pp. 2408–

work page

[16] [16]

van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning (2018),https://arxiv.org/abs/1711.00937

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

Revisiting Natural Gradient for Deep Networks

Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[18] [18]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8238–8247 (2022)

work page 2022

[19] [19]

arXiv preprint arXiv:2503.06456 (2025)

Qian, C., Han, K., Liu, J., Yuan, Z., Zhu, Z., Wang, J., Lyu, C., Chen, J., Liu, Z.: Dyncim: Dynamic curriculum for imbalanced multimodal learning. arXiv preprint arXiv:2503.06456 (2025)

work page arXiv 2025

[20] [20]

arXiv preprint arXiv:2309.06497 (2023)

Shi, H.J.M., Lee, T.H., Iwasaki, S., Gallego-Posada, J., Li, Z., Rangadurai, K., Mudigere, D., Rabbat, M.: A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale. arXiv preprint arXiv:2309.06497 (2023)

work page arXiv 2023

[21] [21]

Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

SOAP: Improving and Stabilizing Shampoo using Adam

Vyas, N., Morwani, D., Zhao, R., Kwun, M., Shapira, I., Brandfonbrener, D., Jan- son, L., Kakade, S.: Soap: Improving and stabilizing shampoo using adam. arXiv preprint arXiv:2409.11321 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Nature pp

Wang, X., Cui, Y., Wang, J., Zhang, F., Wang, Y., Zhang, X., Luo, Z., Sun, Q., Li, Z., Wang, Y., et al.: Multimodal learning with next-token prediction for large multimodal models. Nature pp. 1–7 (2026)

work page 2026

[24] [24]

Advances in neural information processing systems30(2017)

Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems30(2017)

work page 2017

[25] [25]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., et al.: Janus: Decoupling visual encoding for unified multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12966–12977 (2025) 16 Yishun Lu and Wes Armour

work page 2025

[26] [26]

Advances in neural information processing systems33, 5824–5836 (2020)

Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery for multi-task learning. Advances in neural information processing systems33, 5824–5836 (2020)

work page 2020

[27] [27]

Sigmoid Loss for Language Image Pre-Training

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training (2023),https://arxiv.org/abs/2303.15343 Abbreviated paper title 17 Supplementary Material for Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models A Theoretical Analysis of Fisher Information for Modality Balancing In this a...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Abbreviated paper title 19 A.3 Why CanF −1 Mitigate Modality Competition? The key observation follows from the information matrix equality (also known as Bartlett’s identity)

(29) This shows that, locally, the Fisher information matrix defines the intrinsic metric of the statistical manifold. Abbreviated paper title 19 A.3 Why CanF −1 Mitigate Modality Competition? The key observation follows from the information matrix equality (also known as Bartlett’s identity). For log-likelihood objectives, F=−E pθ ∇2 logp(z|θ) =E pθ ∇log...

work page