arxiv: 2605.07914 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.CV

Recognition: no theorem link

Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning

Aristotelis Ballas , Christos Diou

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:15 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords flatnessgradient alignmentexcess risk decompositiondomain generalizationmulti-task learningSAGE optimizerHessianmulti-distribution learning

0 comments

The pith

Excess risk in multi-distribution learning decomposes into independent flatness and gradient-alignment terms that must both be controlled.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that sharpness-aware and gradient-alignment methods each miss one of two necessary geometric properties for good generalization in settings with multiple distributions. It derives an excess-risk bound with an alignment term given by the trace of the inverse average Hessian times the gradient covariance, and a curvature term depending on the average Hessian. These terms cannot bound each other, as shown by a counterexample, meaning methods that optimize only one property cannot guarantee low risk. This leads to the proposal of SAGE, which combines polar-factor perturbations to address curvature with disagreement-scaled noise for alignment, improving performance on domain generalization and multi-task benchmarks.

Core claim

We show that both flatness and gradient alignment are necessary because the excess risk admits an additive decomposition into an alignment term controlled by trace of inverse average Hessian times gradient covariance and a curvature term controlled by the average Hessian, with the Hessian inverted in one term and not the other. A counterexample demonstrates that neither quantity bounds the other, so single-property methods are insufficient. Motivated by this, SAGE replaces SAM's perturbation with the polar factor of each layer's gradient matrix via Newton-Schulz iteration and injects isotropic noise scaled by cross-distribution gradient disagreement.

What carries the argument

The excess-risk decomposition separating an alignment term (trace of inverse average Hessian times gradient covariance) and a curvature term (average Hessian), together with the SAGE optimizer that uses polar-factor perturbations for curvature and disagreement-scaled isotropic noise for alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future optimizers for federated or multi-source learning could incorporate joint monitoring of average Hessian and gradient covariance to diagnose generalization issues.
The decomposition suggests testing whether similar independent terms appear when distributions differ by label shift rather than domain shift.
Hybrid training procedures that alternate between spectral perturbation steps and noise injection steps may generalize the SAGE idea to other base optimizers.

Load-bearing premise

The excess risk admits a leading-order additive decomposition into the stated alignment and curvature terms under smoothness and distribution-shift conditions, with higher-order terms negligible.

What would settle it

An experiment showing that an optimizer targeting only flatness or only alignment achieves excess risk as low as SAGE on the DomainBed and multi-task benchmarks would falsify the necessity of addressing both terms.

Figures

Figures reproduced from arXiv: 2605.07914 by Aristotelis Ballas, Christos Diou.

**Figure 1.** Figure 1: Comparison of standard SAM and our proposed method. (a) Standard SAM computes the perturbation as ϵSAM = ρ ∇θL/∥∇θL∥2, determined solely by the gradient direction (dashed ellipse), and the descent step lands at w SAM (⋆). (b) We replace the gradient with its orthogonal polar factor (G → UV T ), setting all singular values to one and scaling by the layer norm, yielding a perturbation ϵ ′ = ρ∥W∥F ·UV T that … view at source ↗

**Figure 2.** Figure 2: Left: Aggregate loss contours with optimization trajectories from a shared starting point (⋆). ERM (grey), SAM (red), and SGLD (orange) converge to Minimum A, whereas only gradientaligned noise (blue, ours) escapes to Minimum B. Right: The same trajectories overlaid on the gradient agreement landscape S(θ), where blue and red indicate high and low cross-domain gradient agreement, respectively. When S(θ) ≈… view at source ↗

**Figure 3.** Figure 3: (left) reports OfficeHome accuracy over selections of the perturbation radius ρ and the noise scale γ. Performance is stable across selection, with the exception of the selection of either relatively low or high values. In the same, figure on the right we compare the performance of SAGE when removing each of its components and against the Muon [28] optimizer. As apparent from the results, each component of… view at source ↗

**Figure 4.** Figure 4: Empirical scale invariance. Measured sharpness as a function of weight rescaling factor α for a two-layer MLP (no normalization layers). The network computes the same function at all α, so the true sharpness (green, dotted) is constant. Both SAM (red) and ASAM (orange, dashed) produce scale-dependent sharpness estimates that diverge away from α= 1, despite the underlying function being identical. SAGE (blu… view at source ↗

**Figure 5.** Figure 5: Gradient alignment for unseen domains during model training on the datasets of DomainBed. [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

read the original abstract

Sharpness-aware and gradient-alignment methods have been shown to improve generalization, however each family of methods targets a single geometric property of the loss landscape, while ignoring the other. In this paper, we show that this omission is structurally unavoidable and that both flatness and gradient alignment should be considered in multi-distribution learning settings. Specifically, we derive an excess-risk decomposition that yields two additive leading-order terms: (i) an alignment term, controlled by the trace of $\bar{H}^{-1}\Sigma_g$ and (ii) a curvature term, controlled by $\bar{H}$, where $\bar{H}$ is the average Hessian and $\Sigma_g$ is the covariance of the gradient across distributions. Notably, $\bar{H}$ appears inverted in one and non-inverted in the other. We further show, via a counterexample, that neither quantity bounds the other in general, so no algorithm targeting only one term can guarantee low excess risk. Motivated by this decomposition, we propose SAGE (Spectral-Aware Gradient-Aligned Exploration) that targets both terms. The curvature component replaces SAM's gradient-scaled perturbation with the polar factor of each layer's gradient matrix, computed via Newton-Schulz iteration, so that the ascent step probes all directions with similar magnitude. On the other hand, the alignment component injects isotropic noise at the descent step, the magnitude of which scales with cross-distribution gradient disagreement. Experiments on five domain-generalization and two multi-task learning benchmarks show that the proposed method establishes a new state-of-the-art on DomainBed and acts as a general-purpose improvement to base MTL solvers, remaining competitive with, or even surpassing, state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The decomposition shows flatness and alignment are independent in multi-distribution settings, but the higher-order remainder needs explicit bounds.

read the letter

The main point is that this paper derives an excess-risk decomposition for multi-distribution learning with two additive leading terms: one alignment term controlled by the trace of the inverse average Hessian times gradient covariance, and one curvature term controlled by the average Hessian itself. They show via counterexample that neither bounds the other, so single-focus methods like pure SAM or alignment-only approaches are structurally incomplete. That insight drives their SAGE optimizer, which perturbs with the polar factor of the gradient matrix (via Newton-Schulz) to probe curvature evenly and adds scaled isotropic noise on descent to reduce cross-distribution gradient disagreement. Experiments report gains on DomainBed and multi-task benchmarks, which is useful evidence that the combined approach helps in practice. The decomposition and counterexample are genuinely new relative to the cited SAM and alignment literature, and the algorithm is a clean, implementable combination rather than a minor tweak. The soft spot is the approximation quality. The claim rests on higher-order terms being negligible under smoothness and bounded-shift assumptions, yet the abstract and stress-test note give no quantitative remainder bounds or Lipschitz control on third derivatives when per-distribution Hessians diverge from the average. If shifts make local curvatures vary substantially, the averaged H-bar could misrepresent both terms and leave excess risk large even when the two reported quantities are small. The paper would benefit from either tighter analysis or empirical checks showing the error stays controlled on the benchmarks. This work is aimed at researchers building optimizers for domain generalization and multi-task learning. The thinking is clear, the math is laid out directly, and the empirical results are concrete enough that it deserves a serious referee to verify the proof steps and test component ablations. I would send it out for review.

Referee Report

1 major / 3 minor

Summary. The paper claims that flatness and gradient alignment are both necessary in multi-distribution learning. It derives an excess-risk decomposition yielding two additive leading-order terms—an alignment term controlled by tr(¯H^{-1} Σ_g) and a curvature term controlled by ¯H—where ¯H is the average Hessian and Σ_g the cross-distribution gradient covariance. A counterexample shows neither quantity bounds the other in general. Motivated by this, it proposes SAGE, which replaces SAM-style perturbations with the polar factor of each layer's gradient matrix (via Newton-Schulz) for curvature and injects isotropic noise scaled by gradient disagreement for alignment. Experiments report new SOTA on DomainBed and competitive gains on multi-task learning benchmarks.

Significance. If the decomposition is rigorously justified, the work supplies a structural explanation for why single-aspect regularizers are provably insufficient under distribution shift and motivates combined spectral-alignment methods. The counterexample is a useful negative result, the Newton-Schulz implementation is practical, and the reported benchmark improvements (DomainBed, MTL) indicate empirical relevance. These elements would strengthen the case for hybrid geometric regularizers if the leading-order claim holds.

major comments (1)

[excess-risk decomposition] The excess-risk decomposition (abstract and theoretical analysis) states that the two terms dominate under smoothness and bounded-shift conditions, yet provides no quantitative remainder bounds (e.g., third-derivative Lipschitz constants or explicit controls on ||H_i − ¯H||). This assumption is load-bearing for the central claim that the alignment and curvature terms are the leading contributions and that neither can be omitted.

minor comments (3)

[experiments] Experimental tables and figures lack error bars, multiple random seeds, or statistical significance tests, making it difficult to assess whether reported gains are reliable.
[theoretical analysis] The manuscript would benefit from an explicit statement of all assumptions (smoothness constants, bounded gradient covariance, etc.) and a self-contained proof sketch or appendix with the full Taylor-expansion steps.
[preliminaries] Notation for ¯H and Σ_g should be defined at first use with a clear statement of how they are estimated from finite samples across the m distributions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the structural insight provided by the excess-risk decomposition and the counterexample. We address the single major comment below.

read point-by-point responses

Referee: [excess-risk decomposition] The excess-risk decomposition (abstract and theoretical analysis) states that the two terms dominate under smoothness and bounded-shift conditions, yet provides no quantitative remainder bounds (e.g., third-derivative Lipschitz constants or explicit controls on ||H_i − ¯H||). This assumption is load-bearing for the central claim that the alignment and curvature terms are the leading contributions and that neither can be omitted.

Authors: We agree that the absence of explicit quantitative remainder bounds leaves the 'leading-order' claim less precise than it could be. The current derivation relies on L-smoothness together with a bounded-shift assumption that keeps the per-distribution Hessians close to their average; under these conditions the higher-order terms are o(1) as the perturbation radius tends to zero, but the paper does not supply explicit constants (e.g., a third-derivative Lipschitz constant or a uniform bound on ||H_i − ¯H||). In the revised manuscript we will add an appendix subsection that (i) states the additional assumption of thrice continuously differentiable losses with bounded third derivatives, (ii) derives the explicit O(δ³) remainder where δ is the maximum perturbation size, and (iii) makes the bounded-shift condition quantitative by requiring ||H_i − ¯H|| ≤ ε for a small ε that can be absorbed into the leading terms. These additions will be placed after the main decomposition and will not alter the paper's central claims or experimental results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: excess-risk decomposition derived from assumptions

full rationale

The paper presents a derivation of an excess-risk decomposition under stated smoothness and distribution-shift conditions, producing two additive leading-order terms controlled by the average Hessian and gradient covariance. These quantities arise as outputs of the expansion rather than being presupposed by definition or obtained via fitting to the target result. No load-bearing self-citations, self-definitional steps, or renamings of known results are indicated in the abstract or reader's summary. The central claim is therefore self-contained and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the leading-order excess-risk decomposition and the data-dependent definitions of average Hessian and gradient covariance; no free parameters or new entities are introduced beyond the standard geometric quantities.

axioms (1)

domain assumption Excess risk admits an additive leading-order decomposition into an alignment term (trace of inverse average Hessian times gradient covariance) and a curvature term (controlled by average Hessian).
This decomposition is the load-bearing step that produces the two independent terms and the necessity claim.

pith-pipeline@v0.9.0 · 5613 in / 1394 out tokens · 53518 ms · 2026-05-11T03:15:26.886988+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 2 internal anchors

[1]

Visualizing the loss landscape of neural nets.Advances in neural information processing systems, 31, 2018

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets.Advances in neural information processing systems, 31, 2018

work page 2018
[2]

Large scale structure of neural network loss landscapes

Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[3]

Towards understanding generalization of deep learning: Perspec- tive of loss landscapes.arXiv preprint arXiv:1706.10239, 2017

Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspec- tive of loss landscapes.arXiv preprint arXiv:1706.10239, 2017

work page arXiv 2017
[4]

PhD thesis, Johns Hopkins University, 2020

Akshay Rangamani et al.Loss landscapes and generalization in neural networks: Theory and applications. PhD thesis, Johns Hopkins University, 2020

work page 2020
[5]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016

work page internal anchor Pith review arXiv 2016
[6]

arXiv preprint arXiv:1912.02178 , year=

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178, 2019

work page arXiv 1912
[7]

Sharpness-aware min- imization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min- imization for efficiently improving generalization. InInternational Conference on Learning Representations, 2021

work page 2021
[8]

Asam: Adaptive sharpness- aware minimization for scale-invariant learning of deep neural networks

Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness- aware minimization for scale-invariant learning of deep neural networks. InInternational conference on machine learning, pages 5905–5914. PMLR, 2021

work page 2021
[9]

Fisher sam: Information geometry and sharpness aware minimisation

Minyoung Kim, Da Li, Shell X Hu, and Timothy Hospedales. Fisher sam: Information geometry and sharpness aware minimisation. InInternational Conference on Machine Learning, pages 11148–11161. PMLR, 2022

work page 2022
[10]

Domain-inspired sharpness aware minimization under domain shifts

Ruipeng Zhang, Ziqing Fan, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Domain-inspired sharpness aware minimization under domain shifts. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=I4wB3HA3dJ

work page 2024
[11]

Samo: A lightweight sharpness-aware approach for multi-task optimization with joint global-local perturbation

Hao Ban, Gokul Ram Subramani, and Kaiyi Ji. Samo: A lightweight sharpness-aware approach for multi-task optimization with joint global-local perturbation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 785–795, 2025

work page 2025
[12]

Domain generalization via gradient surgery

Lucas Mansilla, Rodrigo Echeveste, Diego H Milone, and Enzo Ferrante. Domain generalization via gradient surgery. InProceedings of the IEEE/CVF international conference on computer vision, pages 6630–6638, 2021

work page 2021
[13]

Gradient matching for domain generalization

Yuge Shi, Jeffrey Seely, Philip Torr, Siddharth N, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=vDwBW49HmO

work page 2022
[14]

Gradient alignment for cross-domain face anti-spoofing

Binh M Le and Simon S Woo. Gradient alignment for cross-domain face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 188–199, 2024

work page 2024
[15]

Gradient-guided annealing for domain generalization

Aristotelis Ballas and Christos Diou. Gradient-guided annealing for domain generalization. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 20558–20568, 2025

work page 2025
[16]

Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 33:5824–5836, 2020

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 33:5824–5836, 2020

work page 2020
[17]

Conflict-averse gradient descent for multi-task learning.Advances in neural information processing systems, 34:18878–18890, 2021

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning.Advances in neural information processing systems, 34:18878–18890, 2021. 10

work page 2021
[18]

Independent component alignment for multi-task learning

Dmitry Senushkin, Nikolay Patakin, Arseny Kuznetsov, and Anton Konushin. Independent component alignment for multi-task learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20083–20093, June 2023

work page 2023
[19]

Generalizing to unseen domains: A survey on domain general- ization.IEEE transactions on knowledge and data engineering, 35(8):8052–8072, 2022

Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and S Yu Philip. Generalizing to unseen domains: A survey on domain general- ization.IEEE transactions on knowledge and data engineering, 35(8):8052–8072, 2022

work page 2022
[20]

IEEE transactions on pattern analysis and machine intelligence45(4), 4396–4415 (2023) https: //doi.org/10.1109/TPAMI.2022.3195549

Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain Generalization: A Survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4396–4415, April 2023. ISSN 1939-3539. doi: 10.1109/TPAMI.2022.3195549. Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence

work page doi:10.1109/tpami.2022.3195549 2023
[21]

A survey on multi-task learning.IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2022

Yu Zhang and Qiang Yang. A survey on multi-task learning.IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2022. doi: 10.1109/TKDE.2021.3070203

work page doi:10.1109/tkde.2021.3070203 2022
[22]

Enhancing sharpness-aware optimization through variance suppression.Advances in Neural Information Processing Systems, 36:70861–70879, 2023

Bingcong Li and Georgios Giannakis. Enhancing sharpness-aware optimization through variance suppression.Advances in Neural Information Processing Systems, 36:70861–70879, 2023

work page 2023
[23]

Preconditioned sharpness-aware minimization: Unifying analysis and a novel learning algorithm

Yilang Zhang, Bingcong Li, and Georgios B Giannakis. Preconditioned sharpness-aware minimization: Unifying analysis and a novel learning algorithm. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025
[24]

Sharpness-aware gradient matching for domain generalization

Pengfei Wang, Zhaoxiang Zhang, Zhen Lei, and Lei Zhang. Sharpness-aware gradient matching for domain generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3769–3778, 2023

work page 2023
[25]

Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

work page arXiv 2024
[26]

Gradient norm aware minimiza- tion seeks first-order flatness and improves generalization

Xingxuan Zhang, Renzhe Xu, Han Yu, Hao Zou, and Peng Cui. Gradient norm aware minimiza- tion seeks first-order flatness and improves generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20247–20257, 2023

work page 2023
[27]

Sassha: Sharpness-aware adaptive second-order optimization with stable hessian approximation

Dahun Shin, Dongyeop Lee, Jinseok Chung, and Namhoon Lee. Sassha: Sharpness-aware adaptive second-order optimization with stable hessian approximation. InForty-second Inter- national Conference on Machine Learning, 2025. URLhttps://openreview.net/forum? id=7bgqx5OoVe

work page 2025
[28]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

work page 2024
[29]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–1850. PMLR, 2018

work page 2018
[30]

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review arXiv 1907
[31]

Cambridge university press, 2000

Aad W Van der Vaart.Asymptotic statistics, volume 3. Cambridge university press, 2000

work page 2000
[32]

Surrogate gap minimization improves sharpness-aware training

Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha C Dvornek, sekhar tatikonda, James s Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=edONMAnhLu-

work page 2022
[33]

arXiv preprint arXiv:2211.05729 , year=

Kaiyue Wen, Tengyu Ma, and Zhiyuan Li. How does sharpness-aware minimization minimize sharpness?, 2023. URLhttps://arxiv.org/abs/2211.05729

work page arXiv 2023
[34]

A schur–newton method for the matrix \boldmath p th root and its inverse.SIAM Journal on Matrix Analysis and Applications, 28(3):788–804, 2006

Chun-Hua Guo and Nicholas J Higham. A schur–newton method for the matrix \boldmath p th root and its inverse.SIAM Journal on Matrix Analysis and Applications, 28(3):788–804, 2006. 11

work page 2006
[35]

An iterative algorithm for computing the best estimate of an orthogonal matrix.SIAM Journal on Numerical Analysis, 8(2):358–364, 1971

Åke Björck and Clazett Bowie. An iterative algorithm for computing the best estimate of an orthogonal matrix.SIAM Journal on Numerical Analysis, 8(2):358–364, 1971

work page 1971
[36]

Some iterative methods for improving orthonormality.SIAM Journal on Numerical Analysis, 7(3):386–389, 1970

Zdislav Kovarik. Some iterative methods for improving orthonormality.SIAM Journal on Numerical Analysis, 7(3):386–389, 1970

work page 1970
[37]

Sharp minima can generalize for deep nets

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. InInternational Conference on Machine Learning, pages 1019–1028. PMLR, 2017

work page 2017
[38]

Towards domain generalization for ecg and eeg classifica- tion: Algorithms and benchmarks.IEEE Transactions on Emerging Topics in Computational Intelligence, 8(1):44–54, 2024

Aristotelis Ballas and Christos Diou. Towards domain generalization for ecg and eeg classifica- tion: Algorithms and benchmarks.IEEE Transactions on Emerging Topics in Computational Intelligence, 8(1):44–54, 2024. doi: 10.1109/TETCI.2023.3306253

work page doi:10.1109/tetci.2023.3306253 2024
[39]

Bayesian learning via stochastic gradient langevin dynamics

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. InProceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011

work page 2011
[40]

Consistency and fluctuations for stochastic gradient langevin dynamics.Journal of Machine Learning Research, 17(7), 2016

Yee Whye Teh, Alexandre Thiéry, and Sebastian J V ollmer. Consistency and fluctuations for stochastic gradient langevin dynamics.Journal of Machine Learning Research, 17(7), 2016

work page 2016
[41]

Swad: Domain generalization by seeking flat minima.Advances in Neural Information Processing Systems, 34:22405–22418, 2021

Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. Swad: Domain generalization by seeking flat minima.Advances in Neural Information Processing Systems, 34:22405–22418, 2021

work page 2021
[42]

Multi-task learning as a bargaining game

Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-task learning as a bargaining game. InInternational Conference on Machine Learning, pages 16428—-16446. PMLR, 2022

work page 2022
[43]

Famo: Fast adaptive multitask optimization

Bo Liu, Yihao Feng, Peter Stone, and Qiang Liu. Famo: Fast adaptive multitask optimization. Advances in Neural Information Processing Systems, 36:57226–57243, 2023

work page 2023
[44]

In search of lost domain generalization

Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. InInternational Conference on Learning Representations, 2021

work page 2021
[45]

Deeper, broader and artier domain generalization

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. InProceedings of the IEEE international conference on computer vision, pages 5542–5550, 2017

work page 2017
[46]

Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias

Chen Fang, Ye Xu, and Daniel N Rockmore. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. InProceedings of the IEEE International Conference on Computer Vision, pages 1657–1664, 2013

work page 2013
[47]

Deep hashing network for unsupervised domain adaptation

Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017

work page 2017
[48]

Recognition in terra incognita

Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InProceedings of the European conference on computer vision (ECCV), pages 456–473, 2018

work page 2018
[49]

Moment matching for multi-source domain adaptation

Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProceedings of the IEEE/CVF international conference on computer vision, pages 1406–1415, 2019

work page 2019
[50]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

work page 2016
[51]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012. 12

work page 2012
[52]

Exploring mode connectivity in krylov subspace for domain generalization

Aodi Li, Liansheng Zhuang, Xiao Long, Houqiang Li, and Shafei Wang. Exploring mode connectivity in krylov subspace for domain generalization. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[53]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[54]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. URL https:...

work page 2021
[55]

Fair resource allocation in multi-task learning

Hao Ban and Kaiyi Ji. Fair resource allocation in multi-task learning. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[56]

End-to-end multi-task learning with attention

Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1871–1880, 2019

work page 2019
[57]

Segnet: A deep convolutional encoder-decoder architecture for image segmentation.IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017

Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation.IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017

work page 2017
[58]

Multi-task learning as multi-objective optimization.Advances in neural information processing systems, 31, 2018

Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization.Advances in neural information processing systems, 31, 2018

work page 2018
[59]

Statistical learning theory.NY: Wiley, 1998

V Vapnik. Statistical learning theory.NY: Wiley, 1998

work page 1998
[60]

Domain-adversarial training of neural networks

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of machine learning research, 17(59):1–35, 2016

work page 2016
[61]

Domain generaliza- tion via conditional invariant representations

Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao. Domain generaliza- tion via conditional invariant representations. InAAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[62]

Domain generalization with adversarial feature learning

Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. InComputer Vision and Pattern Recognition, 2018

work page 2018
[63]

Self-challenging improves cross- domain generalization.European Conference on Computer Vision, 2020

Zeyi Huang, Haohan Wang, Eric P Xing, and Dong Huang. Self-challenging improves cross- domain generalization.European Conference on Computer Vision, 2020

work page 2020
[64]

Deep coral: Correlation alignment for deep domain adaptation

Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. InEuropean Conference on Computer Vision, 2016

work page 2016
[65]

Invariant information bottleneck for domain generalization

Bo Li, Yifei Shen, Yezhen Wang, Wenzhen Zhu, Dongsheng Li, Kurt Keutzer, and Han Zhao. Invariant information bottleneck for domain generalization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7399–7407, 2022

work page 2022
[66]

Reasonable effectiveness of random weighting: A litmus test for multi-task learning.arXiv preprint arXiv:2111.10603,

Baijiong Lin, Feiyang Ye, Yu Zhang, and Ivor W Tsang. Reasonable effectiveness of random weighting: A litmus test for multi-task learning.arXiv preprint arXiv:2111.10603, 2021

work page arXiv 2021
[67]

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018

work page 2018
[68]

Just pick a sign: Optimizing deep multitask models with gradient sign dropout.Advances in Neural Information Processing Systems, 33:2039–2050, 2020

Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. Just pick a sign: Optimizing deep multitask models with gradient sign dropout.Advances in Neural Information Processing Systems, 33:2039–2050, 2020. 13

work page 2039
[69]

Towards impartial multi-task learning

Liyang Liu, Yi Li, Zhanghui Kuang, Jing-Hao Xue, Yimin Chen, Wenming Yang, Qingmin Liao, and Wayne Zhang. Towards impartial multi-task learning. InInternational Conference on Learn- ing Representations, 2021. URLhttps://openreview.net/forum?id=IMPnRXEWpvr

work page 2021
[70]

Mitigating gradient bias in multi-objective learning: A provably convergent approach

Heshan Devaka Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, and Tianyi Chen. Mitigating gradient bias in multi-objective learning: A provably convergent approach. InThe eleventh international conference on learning representations, 2023

work page 2023
[71]

Beyond losses reweighting: Empowering multi-task learning via the general- ization perspective

Hoang Phan, Lam Tran, Quyen Tran, Ngoc Tran, Tuan Truong, Qi Lei, Nhat Ho, Dinh Phung, and Trung Le. Beyond losses reweighting: Empowering multi-task learning via the general- ization perspective. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2440–2450, 2025

work page 2025
[72]

Out-of-distribution generalization via risk ex- trapolation (rex)

David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk ex- trapolation (rex). InInternational conference on machine learning, pages 5815–5826. PMLR, 2021

work page 2021
[73]

Fishr: Invariant gradient variances for out-of-distribution generalization

Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for out-of-distribution generalization. InInternational Conference on Machine Learning, pages 18347–18377. PMLR, 2022

work page 2022
[74]

hypothesis selection

Jean-Antoine Désidéri. Multiple-gradient descent algorithm (mgda) for multiobjective opti- mization.Comptes Rendus. Mathématique, 350(5-6):313–318, 2012. 14 Appendix Table of Contents A Extended Related Work 16 A.1 Domain Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Multi-Task Learning . . . . . . . . . . . . . . . . ....

work page 2012