pith. machine review for the scientific record. sign in

arxiv: 2605.07914 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.CV

Recognition: no theorem link

Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:15 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords flatnessgradient alignmentexcess risk decompositiondomain generalizationmulti-task learningSAGE optimizerHessianmulti-distribution learning
0
0 comments X

The pith

Excess risk in multi-distribution learning decomposes into independent flatness and gradient-alignment terms that must both be controlled.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that sharpness-aware and gradient-alignment methods each miss one of two necessary geometric properties for good generalization in settings with multiple distributions. It derives an excess-risk bound with an alignment term given by the trace of the inverse average Hessian times the gradient covariance, and a curvature term depending on the average Hessian. These terms cannot bound each other, as shown by a counterexample, meaning methods that optimize only one property cannot guarantee low risk. This leads to the proposal of SAGE, which combines polar-factor perturbations to address curvature with disagreement-scaled noise for alignment, improving performance on domain generalization and multi-task benchmarks.

Core claim

We show that both flatness and gradient alignment are necessary because the excess risk admits an additive decomposition into an alignment term controlled by trace of inverse average Hessian times gradient covariance and a curvature term controlled by the average Hessian, with the Hessian inverted in one term and not the other. A counterexample demonstrates that neither quantity bounds the other, so single-property methods are insufficient. Motivated by this, SAGE replaces SAM's perturbation with the polar factor of each layer's gradient matrix via Newton-Schulz iteration and injects isotropic noise scaled by cross-distribution gradient disagreement.

What carries the argument

The excess-risk decomposition separating an alignment term (trace of inverse average Hessian times gradient covariance) and a curvature term (average Hessian), together with the SAGE optimizer that uses polar-factor perturbations for curvature and disagreement-scaled isotropic noise for alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future optimizers for federated or multi-source learning could incorporate joint monitoring of average Hessian and gradient covariance to diagnose generalization issues.
  • The decomposition suggests testing whether similar independent terms appear when distributions differ by label shift rather than domain shift.
  • Hybrid training procedures that alternate between spectral perturbation steps and noise injection steps may generalize the SAGE idea to other base optimizers.

Load-bearing premise

The excess risk admits a leading-order additive decomposition into the stated alignment and curvature terms under smoothness and distribution-shift conditions, with higher-order terms negligible.

What would settle it

An experiment showing that an optimizer targeting only flatness or only alignment achieves excess risk as low as SAGE on the DomainBed and multi-task benchmarks would falsify the necessity of addressing both terms.

Figures

Figures reproduced from arXiv: 2605.07914 by Aristotelis Ballas, Christos Diou.

Figure 1
Figure 1. Figure 1: Comparison of standard SAM and our proposed method. (a) Standard SAM computes the perturbation as ϵSAM = ρ ∇θL/∥∇θL∥2, determined solely by the gradient direction (dashed ellipse), and the descent step lands at w SAM (⋆). (b) We replace the gradient with its orthogonal polar factor (G → UV T ), setting all singular values to one and scaling by the layer norm, yielding a perturbation ϵ ′ = ρ∥W∥F ·UV T that … view at source ↗
Figure 2
Figure 2. Figure 2: Left: Aggregate loss contours with optimization trajectories from a shared starting point (⋆). ERM (grey), SAM (red), and SGLD (orange) converge to Minimum A, whereas only gradient￾aligned noise (blue, ours) escapes to Minimum B. Right: The same trajectories overlaid on the gradient agreement landscape S(θ), where blue and red indicate high and low cross-domain gradient agreement, respectively. When S(θ) ≈… view at source ↗
Figure 3
Figure 3. Figure 3: (left) reports OfficeHome accuracy over selections of the perturbation radius ρ and the noise scale γ. Performance is stable across selection, with the exception of the selection of either relatively low or high values. In the same, figure on the right we compare the performance of SAGE when removing each of its components and against the Muon [28] optimizer. As apparent from the results, each component of… view at source ↗
Figure 4
Figure 4. Figure 4: Empirical scale invariance. Measured sharpness as a function of weight rescaling factor α for a two-layer MLP (no normalization layers). The network computes the same function at all α, so the true sharpness (green, dotted) is constant. Both SAM (red) and ASAM (orange, dashed) produce scale-dependent sharpness estimates that diverge away from α= 1, despite the underlying function being identical. SAGE (blu… view at source ↗
Figure 5
Figure 5. Figure 5: Gradient alignment for unseen domains during model training on the datasets of DomainBed. [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
read the original abstract

Sharpness-aware and gradient-alignment methods have been shown to improve generalization, however each family of methods targets a single geometric property of the loss landscape, while ignoring the other. In this paper, we show that this omission is structurally unavoidable and that both flatness and gradient alignment should be considered in multi-distribution learning settings. Specifically, we derive an excess-risk decomposition that yields two additive leading-order terms: (i) an alignment term, controlled by the trace of $\bar{H}^{-1}\Sigma_g$ and (ii) a curvature term, controlled by $\bar{H}$, where $\bar{H}$ is the average Hessian and $\Sigma_g$ is the covariance of the gradient across distributions. Notably, $\bar{H}$ appears inverted in one and non-inverted in the other. We further show, via a counterexample, that neither quantity bounds the other in general, so no algorithm targeting only one term can guarantee low excess risk. Motivated by this decomposition, we propose SAGE (Spectral-Aware Gradient-Aligned Exploration) that targets both terms. The curvature component replaces SAM's gradient-scaled perturbation with the polar factor of each layer's gradient matrix, computed via Newton-Schulz iteration, so that the ascent step probes all directions with similar magnitude. On the other hand, the alignment component injects isotropic noise at the descent step, the magnitude of which scales with cross-distribution gradient disagreement. Experiments on five domain-generalization and two multi-task learning benchmarks show that the proposed method establishes a new state-of-the-art on DomainBed and acts as a general-purpose improvement to base MTL solvers, remaining competitive with, or even surpassing, state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper claims that flatness and gradient alignment are both necessary in multi-distribution learning. It derives an excess-risk decomposition yielding two additive leading-order terms—an alignment term controlled by tr(¯H^{-1} Σ_g) and a curvature term controlled by ¯H—where ¯H is the average Hessian and Σ_g the cross-distribution gradient covariance. A counterexample shows neither quantity bounds the other in general. Motivated by this, it proposes SAGE, which replaces SAM-style perturbations with the polar factor of each layer's gradient matrix (via Newton-Schulz) for curvature and injects isotropic noise scaled by gradient disagreement for alignment. Experiments report new SOTA on DomainBed and competitive gains on multi-task learning benchmarks.

Significance. If the decomposition is rigorously justified, the work supplies a structural explanation for why single-aspect regularizers are provably insufficient under distribution shift and motivates combined spectral-alignment methods. The counterexample is a useful negative result, the Newton-Schulz implementation is practical, and the reported benchmark improvements (DomainBed, MTL) indicate empirical relevance. These elements would strengthen the case for hybrid geometric regularizers if the leading-order claim holds.

major comments (1)
  1. [excess-risk decomposition] The excess-risk decomposition (abstract and theoretical analysis) states that the two terms dominate under smoothness and bounded-shift conditions, yet provides no quantitative remainder bounds (e.g., third-derivative Lipschitz constants or explicit controls on ||H_i − ¯H||). This assumption is load-bearing for the central claim that the alignment and curvature terms are the leading contributions and that neither can be omitted.
minor comments (3)
  1. [experiments] Experimental tables and figures lack error bars, multiple random seeds, or statistical significance tests, making it difficult to assess whether reported gains are reliable.
  2. [theoretical analysis] The manuscript would benefit from an explicit statement of all assumptions (smoothness constants, bounded gradient covariance, etc.) and a self-contained proof sketch or appendix with the full Taylor-expansion steps.
  3. [preliminaries] Notation for ¯H and Σ_g should be defined at first use with a clear statement of how they are estimated from finite samples across the m distributions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the structural insight provided by the excess-risk decomposition and the counterexample. We address the single major comment below.

read point-by-point responses
  1. Referee: [excess-risk decomposition] The excess-risk decomposition (abstract and theoretical analysis) states that the two terms dominate under smoothness and bounded-shift conditions, yet provides no quantitative remainder bounds (e.g., third-derivative Lipschitz constants or explicit controls on ||H_i − ¯H||). This assumption is load-bearing for the central claim that the alignment and curvature terms are the leading contributions and that neither can be omitted.

    Authors: We agree that the absence of explicit quantitative remainder bounds leaves the 'leading-order' claim less precise than it could be. The current derivation relies on L-smoothness together with a bounded-shift assumption that keeps the per-distribution Hessians close to their average; under these conditions the higher-order terms are o(1) as the perturbation radius tends to zero, but the paper does not supply explicit constants (e.g., a third-derivative Lipschitz constant or a uniform bound on ||H_i − ¯H||). In the revised manuscript we will add an appendix subsection that (i) states the additional assumption of thrice continuously differentiable losses with bounded third derivatives, (ii) derives the explicit O(δ³) remainder where δ is the maximum perturbation size, and (iii) makes the bounded-shift condition quantitative by requiring ||H_i − ¯H|| ≤ ε for a small ε that can be absorbed into the leading terms. These additions will be placed after the main decomposition and will not alter the paper's central claims or experimental results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: excess-risk decomposition derived from assumptions

full rationale

The paper presents a derivation of an excess-risk decomposition under stated smoothness and distribution-shift conditions, producing two additive leading-order terms controlled by the average Hessian and gradient covariance. These quantities arise as outputs of the expansion rather than being presupposed by definition or obtained via fitting to the target result. No load-bearing self-citations, self-definitional steps, or renamings of known results are indicated in the abstract or reader's summary. The central claim is therefore self-contained and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the leading-order excess-risk decomposition and the data-dependent definitions of average Hessian and gradient covariance; no free parameters or new entities are introduced beyond the standard geometric quantities.

axioms (1)
  • domain assumption Excess risk admits an additive leading-order decomposition into an alignment term (trace of inverse average Hessian times gradient covariance) and a curvature term (controlled by average Hessian).
    This decomposition is the load-bearing step that produces the two independent terms and the necessity claim.

pith-pipeline@v0.9.0 · 5613 in / 1394 out tokens · 53518 ms · 2026-05-11T03:15:26.886988+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 2 internal anchors

  1. [1]

    Visualizing the loss landscape of neural nets.Advances in neural information processing systems, 31, 2018

    Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets.Advances in neural information processing systems, 31, 2018

  2. [2]

    Large scale structure of neural network loss landscapes

    Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32, 2019

  3. [3]

    Towards understanding generalization of deep learning: Perspec- tive of loss landscapes.arXiv preprint arXiv:1706.10239, 2017

    Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspec- tive of loss landscapes.arXiv preprint arXiv:1706.10239, 2017

  4. [4]

    PhD thesis, Johns Hopkins University, 2020

    Akshay Rangamani et al.Loss landscapes and generalization in neural networks: Theory and applications. PhD thesis, Johns Hopkins University, 2020

  5. [5]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016

  6. [6]

    arXiv preprint arXiv:1912.02178 , year=

    Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178, 2019

  7. [7]

    Sharpness-aware min- imization for efficiently improving generalization

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min- imization for efficiently improving generalization. InInternational Conference on Learning Representations, 2021

  8. [8]

    Asam: Adaptive sharpness- aware minimization for scale-invariant learning of deep neural networks

    Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness- aware minimization for scale-invariant learning of deep neural networks. InInternational conference on machine learning, pages 5905–5914. PMLR, 2021

  9. [9]

    Fisher sam: Information geometry and sharpness aware minimisation

    Minyoung Kim, Da Li, Shell X Hu, and Timothy Hospedales. Fisher sam: Information geometry and sharpness aware minimisation. InInternational Conference on Machine Learning, pages 11148–11161. PMLR, 2022

  10. [10]

    Domain-inspired sharpness aware minimization under domain shifts

    Ruipeng Zhang, Ziqing Fan, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Domain-inspired sharpness aware minimization under domain shifts. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=I4wB3HA3dJ

  11. [11]

    Samo: A lightweight sharpness-aware approach for multi-task optimization with joint global-local perturbation

    Hao Ban, Gokul Ram Subramani, and Kaiyi Ji. Samo: A lightweight sharpness-aware approach for multi-task optimization with joint global-local perturbation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 785–795, 2025

  12. [12]

    Domain generalization via gradient surgery

    Lucas Mansilla, Rodrigo Echeveste, Diego H Milone, and Enzo Ferrante. Domain generalization via gradient surgery. InProceedings of the IEEE/CVF international conference on computer vision, pages 6630–6638, 2021

  13. [13]

    Gradient matching for domain generalization

    Yuge Shi, Jeffrey Seely, Philip Torr, Siddharth N, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=vDwBW49HmO

  14. [14]

    Gradient alignment for cross-domain face anti-spoofing

    Binh M Le and Simon S Woo. Gradient alignment for cross-domain face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 188–199, 2024

  15. [15]

    Gradient-guided annealing for domain generalization

    Aristotelis Ballas and Christos Diou. Gradient-guided annealing for domain generalization. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 20558–20568, 2025

  16. [16]

    Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 33:5824–5836, 2020

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 33:5824–5836, 2020

  17. [17]

    Conflict-averse gradient descent for multi-task learning.Advances in neural information processing systems, 34:18878–18890, 2021

    Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning.Advances in neural information processing systems, 34:18878–18890, 2021. 10

  18. [18]

    Independent component alignment for multi-task learning

    Dmitry Senushkin, Nikolay Patakin, Arseny Kuznetsov, and Anton Konushin. Independent component alignment for multi-task learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20083–20093, June 2023

  19. [19]

    Generalizing to unseen domains: A survey on domain general- ization.IEEE transactions on knowledge and data engineering, 35(8):8052–8072, 2022

    Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and S Yu Philip. Generalizing to unseen domains: A survey on domain general- ization.IEEE transactions on knowledge and data engineering, 35(8):8052–8072, 2022

  20. [20]

    IEEE transactions on pattern analysis and machine intelligence45(4), 4396–4415 (2023) https: //doi.org/10.1109/TPAMI.2022.3195549

    Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain Generalization: A Survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4396–4415, April 2023. ISSN 1939-3539. doi: 10.1109/TPAMI.2022.3195549. Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence

  21. [21]

    A survey on multi-task learning.IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2022

    Yu Zhang and Qiang Yang. A survey on multi-task learning.IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2022. doi: 10.1109/TKDE.2021.3070203

  22. [22]

    Enhancing sharpness-aware optimization through variance suppression.Advances in Neural Information Processing Systems, 36:70861–70879, 2023

    Bingcong Li and Georgios Giannakis. Enhancing sharpness-aware optimization through variance suppression.Advances in Neural Information Processing Systems, 36:70861–70879, 2023

  23. [23]

    Preconditioned sharpness-aware minimization: Unifying analysis and a novel learning algorithm

    Yilang Zhang, Bingcong Li, and Georgios B Giannakis. Preconditioned sharpness-aware minimization: Unifying analysis and a novel learning algorithm. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  24. [24]

    Sharpness-aware gradient matching for domain generalization

    Pengfei Wang, Zhaoxiang Zhang, Zhen Lei, and Lei Zhang. Sharpness-aware gradient matching for domain generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3769–3778, 2023

  25. [25]

    Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

    Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

  26. [26]

    Gradient norm aware minimiza- tion seeks first-order flatness and improves generalization

    Xingxuan Zhang, Renzhe Xu, Han Yu, Hao Zou, and Peng Cui. Gradient norm aware minimiza- tion seeks first-order flatness and improves generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20247–20257, 2023

  27. [27]

    Sassha: Sharpness-aware adaptive second-order optimization with stable hessian approximation

    Dahun Shin, Dongyeop Lee, Jinseok Chung, and Namhoon Lee. Sassha: Sharpness-aware adaptive second-order optimization with stable hessian approximation. InForty-second Inter- national Conference on Machine Learning, 2025. URLhttps://openreview.net/forum? id=7bgqx5OoVe

  28. [28]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

  29. [29]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–1850. PMLR, 2018

  30. [30]

    Invariant Risk Minimization

    Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019

  31. [31]

    Cambridge university press, 2000

    Aad W Van der Vaart.Asymptotic statistics, volume 3. Cambridge university press, 2000

  32. [32]

    Surrogate gap minimization improves sharpness-aware training

    Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha C Dvornek, sekhar tatikonda, James s Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=edONMAnhLu-

  33. [33]

    arXiv preprint arXiv:2211.05729 , year=

    Kaiyue Wen, Tengyu Ma, and Zhiyuan Li. How does sharpness-aware minimization minimize sharpness?, 2023. URLhttps://arxiv.org/abs/2211.05729

  34. [34]

    A schur–newton method for the matrix \boldmath p th root and its inverse.SIAM Journal on Matrix Analysis and Applications, 28(3):788–804, 2006

    Chun-Hua Guo and Nicholas J Higham. A schur–newton method for the matrix \boldmath p th root and its inverse.SIAM Journal on Matrix Analysis and Applications, 28(3):788–804, 2006. 11

  35. [35]

    An iterative algorithm for computing the best estimate of an orthogonal matrix.SIAM Journal on Numerical Analysis, 8(2):358–364, 1971

    Åke Björck and Clazett Bowie. An iterative algorithm for computing the best estimate of an orthogonal matrix.SIAM Journal on Numerical Analysis, 8(2):358–364, 1971

  36. [36]

    Some iterative methods for improving orthonormality.SIAM Journal on Numerical Analysis, 7(3):386–389, 1970

    Zdislav Kovarik. Some iterative methods for improving orthonormality.SIAM Journal on Numerical Analysis, 7(3):386–389, 1970

  37. [37]

    Sharp minima can generalize for deep nets

    Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. InInternational Conference on Machine Learning, pages 1019–1028. PMLR, 2017

  38. [38]

    Towards domain generalization for ecg and eeg classifica- tion: Algorithms and benchmarks.IEEE Transactions on Emerging Topics in Computational Intelligence, 8(1):44–54, 2024

    Aristotelis Ballas and Christos Diou. Towards domain generalization for ecg and eeg classifica- tion: Algorithms and benchmarks.IEEE Transactions on Emerging Topics in Computational Intelligence, 8(1):44–54, 2024. doi: 10.1109/TETCI.2023.3306253

  39. [39]

    Bayesian learning via stochastic gradient langevin dynamics

    Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. InProceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011

  40. [40]

    Consistency and fluctuations for stochastic gradient langevin dynamics.Journal of Machine Learning Research, 17(7), 2016

    Yee Whye Teh, Alexandre Thiéry, and Sebastian J V ollmer. Consistency and fluctuations for stochastic gradient langevin dynamics.Journal of Machine Learning Research, 17(7), 2016

  41. [41]

    Swad: Domain generalization by seeking flat minima.Advances in Neural Information Processing Systems, 34:22405–22418, 2021

    Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. Swad: Domain generalization by seeking flat minima.Advances in Neural Information Processing Systems, 34:22405–22418, 2021

  42. [42]

    Multi-task learning as a bargaining game

    Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-task learning as a bargaining game. InInternational Conference on Machine Learning, pages 16428—-16446. PMLR, 2022

  43. [43]

    Famo: Fast adaptive multitask optimization

    Bo Liu, Yihao Feng, Peter Stone, and Qiang Liu. Famo: Fast adaptive multitask optimization. Advances in Neural Information Processing Systems, 36:57226–57243, 2023

  44. [44]

    In search of lost domain generalization

    Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. InInternational Conference on Learning Representations, 2021

  45. [45]

    Deeper, broader and artier domain generalization

    Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. InProceedings of the IEEE international conference on computer vision, pages 5542–5550, 2017

  46. [46]

    Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias

    Chen Fang, Ye Xu, and Daniel N Rockmore. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. InProceedings of the IEEE International Conference on Computer Vision, pages 1657–1664, 2013

  47. [47]

    Deep hashing network for unsupervised domain adaptation

    Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017

  48. [48]

    Recognition in terra incognita

    Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InProceedings of the European conference on computer vision (ECCV), pages 456–473, 2018

  49. [49]

    Moment matching for multi-source domain adaptation

    Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProceedings of the IEEE/CVF international conference on computer vision, pages 1406–1415, 2019

  50. [50]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

  51. [51]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012. 12

  52. [52]

    Exploring mode connectivity in krylov subspace for domain generalization

    Aodi Li, Liansheng Zhuang, Xiao Long, Houqiang Li, and Shafei Wang. Exploring mode connectivity in krylov subspace for domain generalization. InThe Fourteenth International Conference on Learning Representations, 2026

  53. [53]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  54. [54]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. URL https:...

  55. [55]

    Fair resource allocation in multi-task learning

    Hao Ban and Kaiyi Ji. Fair resource allocation in multi-task learning. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  56. [56]

    End-to-end multi-task learning with attention

    Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1871–1880, 2019

  57. [57]

    Segnet: A deep convolutional encoder-decoder architecture for image segmentation.IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017

    Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation.IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017

  58. [58]

    Multi-task learning as multi-objective optimization.Advances in neural information processing systems, 31, 2018

    Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization.Advances in neural information processing systems, 31, 2018

  59. [59]

    Statistical learning theory.NY: Wiley, 1998

    V Vapnik. Statistical learning theory.NY: Wiley, 1998

  60. [60]

    Domain-adversarial training of neural networks

    Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of machine learning research, 17(59):1–35, 2016

  61. [61]

    Domain generaliza- tion via conditional invariant representations

    Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao. Domain generaliza- tion via conditional invariant representations. InAAAI Conference on Artificial Intelligence, volume 32, 2018

  62. [62]

    Domain generalization with adversarial feature learning

    Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. InComputer Vision and Pattern Recognition, 2018

  63. [63]

    Self-challenging improves cross- domain generalization.European Conference on Computer Vision, 2020

    Zeyi Huang, Haohan Wang, Eric P Xing, and Dong Huang. Self-challenging improves cross- domain generalization.European Conference on Computer Vision, 2020

  64. [64]

    Deep coral: Correlation alignment for deep domain adaptation

    Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. InEuropean Conference on Computer Vision, 2016

  65. [65]

    Invariant information bottleneck for domain generalization

    Bo Li, Yifei Shen, Yezhen Wang, Wenzhen Zhu, Dongsheng Li, Kurt Keutzer, and Han Zhao. Invariant information bottleneck for domain generalization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7399–7407, 2022

  66. [66]

    Reasonable effectiveness of random weighting: A litmus test for multi-task learning.arXiv preprint arXiv:2111.10603,

    Baijiong Lin, Feiyang Ye, Yu Zhang, and Ivor W Tsang. Reasonable effectiveness of random weighting: A litmus test for multi-task learning.arXiv preprint arXiv:2111.10603, 2021

  67. [67]

    Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

    Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018

  68. [68]

    Just pick a sign: Optimizing deep multitask models with gradient sign dropout.Advances in Neural Information Processing Systems, 33:2039–2050, 2020

    Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. Just pick a sign: Optimizing deep multitask models with gradient sign dropout.Advances in Neural Information Processing Systems, 33:2039–2050, 2020. 13

  69. [69]

    Towards impartial multi-task learning

    Liyang Liu, Yi Li, Zhanghui Kuang, Jing-Hao Xue, Yimin Chen, Wenming Yang, Qingmin Liao, and Wayne Zhang. Towards impartial multi-task learning. InInternational Conference on Learn- ing Representations, 2021. URLhttps://openreview.net/forum?id=IMPnRXEWpvr

  70. [70]

    Mitigating gradient bias in multi-objective learning: A provably convergent approach

    Heshan Devaka Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, and Tianyi Chen. Mitigating gradient bias in multi-objective learning: A provably convergent approach. InThe eleventh international conference on learning representations, 2023

  71. [71]

    Beyond losses reweighting: Empowering multi-task learning via the general- ization perspective

    Hoang Phan, Lam Tran, Quyen Tran, Ngoc Tran, Tuan Truong, Qi Lei, Nhat Ho, Dinh Phung, and Trung Le. Beyond losses reweighting: Empowering multi-task learning via the general- ization perspective. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2440–2450, 2025

  72. [72]

    Out-of-distribution generalization via risk ex- trapolation (rex)

    David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk ex- trapolation (rex). InInternational conference on machine learning, pages 5815–5826. PMLR, 2021

  73. [73]

    Fishr: Invariant gradient variances for out-of-distribution generalization

    Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for out-of-distribution generalization. InInternational Conference on Machine Learning, pages 18347–18377. PMLR, 2022

  74. [74]

    hypothesis selection

    Jean-Antoine Désidéri. Multiple-gradient descent algorithm (mgda) for multiobjective opti- mization.Comptes Rendus. Mathématique, 350(5-6):313–318, 2012. 14 Appendix Table of Contents A Extended Related Work 16 A.1 Domain Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Multi-Task Learning . . . . . . . . . . . . . . . . ....