Recognition: no theorem link
Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning
Pith reviewed 2026-05-11 03:15 UTC · model grok-4.3
The pith
Excess risk in multi-distribution learning decomposes into independent flatness and gradient-alignment terms that must both be controlled.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that both flatness and gradient alignment are necessary because the excess risk admits an additive decomposition into an alignment term controlled by trace of inverse average Hessian times gradient covariance and a curvature term controlled by the average Hessian, with the Hessian inverted in one term and not the other. A counterexample demonstrates that neither quantity bounds the other, so single-property methods are insufficient. Motivated by this, SAGE replaces SAM's perturbation with the polar factor of each layer's gradient matrix via Newton-Schulz iteration and injects isotropic noise scaled by cross-distribution gradient disagreement.
What carries the argument
The excess-risk decomposition separating an alignment term (trace of inverse average Hessian times gradient covariance) and a curvature term (average Hessian), together with the SAGE optimizer that uses polar-factor perturbations for curvature and disagreement-scaled isotropic noise for alignment.
Where Pith is reading between the lines
- Future optimizers for federated or multi-source learning could incorporate joint monitoring of average Hessian and gradient covariance to diagnose generalization issues.
- The decomposition suggests testing whether similar independent terms appear when distributions differ by label shift rather than domain shift.
- Hybrid training procedures that alternate between spectral perturbation steps and noise injection steps may generalize the SAGE idea to other base optimizers.
Load-bearing premise
The excess risk admits a leading-order additive decomposition into the stated alignment and curvature terms under smoothness and distribution-shift conditions, with higher-order terms negligible.
What would settle it
An experiment showing that an optimizer targeting only flatness or only alignment achieves excess risk as low as SAGE on the DomainBed and multi-task benchmarks would falsify the necessity of addressing both terms.
Figures
read the original abstract
Sharpness-aware and gradient-alignment methods have been shown to improve generalization, however each family of methods targets a single geometric property of the loss landscape, while ignoring the other. In this paper, we show that this omission is structurally unavoidable and that both flatness and gradient alignment should be considered in multi-distribution learning settings. Specifically, we derive an excess-risk decomposition that yields two additive leading-order terms: (i) an alignment term, controlled by the trace of $\bar{H}^{-1}\Sigma_g$ and (ii) a curvature term, controlled by $\bar{H}$, where $\bar{H}$ is the average Hessian and $\Sigma_g$ is the covariance of the gradient across distributions. Notably, $\bar{H}$ appears inverted in one and non-inverted in the other. We further show, via a counterexample, that neither quantity bounds the other in general, so no algorithm targeting only one term can guarantee low excess risk. Motivated by this decomposition, we propose SAGE (Spectral-Aware Gradient-Aligned Exploration) that targets both terms. The curvature component replaces SAM's gradient-scaled perturbation with the polar factor of each layer's gradient matrix, computed via Newton-Schulz iteration, so that the ascent step probes all directions with similar magnitude. On the other hand, the alignment component injects isotropic noise at the descent step, the magnitude of which scales with cross-distribution gradient disagreement. Experiments on five domain-generalization and two multi-task learning benchmarks show that the proposed method establishes a new state-of-the-art on DomainBed and acts as a general-purpose improvement to base MTL solvers, remaining competitive with, or even surpassing, state-of-the-art methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that flatness and gradient alignment are both necessary in multi-distribution learning. It derives an excess-risk decomposition yielding two additive leading-order terms—an alignment term controlled by tr(¯H^{-1} Σ_g) and a curvature term controlled by ¯H—where ¯H is the average Hessian and Σ_g the cross-distribution gradient covariance. A counterexample shows neither quantity bounds the other in general. Motivated by this, it proposes SAGE, which replaces SAM-style perturbations with the polar factor of each layer's gradient matrix (via Newton-Schulz) for curvature and injects isotropic noise scaled by gradient disagreement for alignment. Experiments report new SOTA on DomainBed and competitive gains on multi-task learning benchmarks.
Significance. If the decomposition is rigorously justified, the work supplies a structural explanation for why single-aspect regularizers are provably insufficient under distribution shift and motivates combined spectral-alignment methods. The counterexample is a useful negative result, the Newton-Schulz implementation is practical, and the reported benchmark improvements (DomainBed, MTL) indicate empirical relevance. These elements would strengthen the case for hybrid geometric regularizers if the leading-order claim holds.
major comments (1)
- [excess-risk decomposition] The excess-risk decomposition (abstract and theoretical analysis) states that the two terms dominate under smoothness and bounded-shift conditions, yet provides no quantitative remainder bounds (e.g., third-derivative Lipschitz constants or explicit controls on ||H_i − ¯H||). This assumption is load-bearing for the central claim that the alignment and curvature terms are the leading contributions and that neither can be omitted.
minor comments (3)
- [experiments] Experimental tables and figures lack error bars, multiple random seeds, or statistical significance tests, making it difficult to assess whether reported gains are reliable.
- [theoretical analysis] The manuscript would benefit from an explicit statement of all assumptions (smoothness constants, bounded gradient covariance, etc.) and a self-contained proof sketch or appendix with the full Taylor-expansion steps.
- [preliminaries] Notation for ¯H and Σ_g should be defined at first use with a clear statement of how they are estimated from finite samples across the m distributions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the structural insight provided by the excess-risk decomposition and the counterexample. We address the single major comment below.
read point-by-point responses
-
Referee: [excess-risk decomposition] The excess-risk decomposition (abstract and theoretical analysis) states that the two terms dominate under smoothness and bounded-shift conditions, yet provides no quantitative remainder bounds (e.g., third-derivative Lipschitz constants or explicit controls on ||H_i − ¯H||). This assumption is load-bearing for the central claim that the alignment and curvature terms are the leading contributions and that neither can be omitted.
Authors: We agree that the absence of explicit quantitative remainder bounds leaves the 'leading-order' claim less precise than it could be. The current derivation relies on L-smoothness together with a bounded-shift assumption that keeps the per-distribution Hessians close to their average; under these conditions the higher-order terms are o(1) as the perturbation radius tends to zero, but the paper does not supply explicit constants (e.g., a third-derivative Lipschitz constant or a uniform bound on ||H_i − ¯H||). In the revised manuscript we will add an appendix subsection that (i) states the additional assumption of thrice continuously differentiable losses with bounded third derivatives, (ii) derives the explicit O(δ³) remainder where δ is the maximum perturbation size, and (iii) makes the bounded-shift condition quantitative by requiring ||H_i − ¯H|| ≤ ε for a small ε that can be absorbed into the leading terms. These additions will be placed after the main decomposition and will not alter the paper's central claims or experimental results. revision: yes
Circularity Check
No significant circularity: excess-risk decomposition derived from assumptions
full rationale
The paper presents a derivation of an excess-risk decomposition under stated smoothness and distribution-shift conditions, producing two additive leading-order terms controlled by the average Hessian and gradient covariance. These quantities arise as outputs of the expansion rather than being presupposed by definition or obtained via fitting to the target result. No load-bearing self-citations, self-definitional steps, or renamings of known results are indicated in the abstract or reader's summary. The central claim is therefore self-contained and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Excess risk admits an additive leading-order decomposition into an alignment term (trace of inverse average Hessian times gradient covariance) and a curvature term (controlled by average Hessian).
Reference graph
Works this paper leans on
-
[1]
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets.Advances in neural information processing systems, 31, 2018
work page 2018
-
[2]
Large scale structure of neural network loss landscapes
Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[3]
Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspec- tive of loss landscapes.arXiv preprint arXiv:1706.10239, 2017
-
[4]
PhD thesis, Johns Hopkins University, 2020
Akshay Rangamani et al.Loss landscapes and generalization in neural networks: Theory and applications. PhD thesis, Johns Hopkins University, 2020
work page 2020
-
[5]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016
work page internal anchor Pith review arXiv 2016
-
[6]
arXiv preprint arXiv:1912.02178 , year=
Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178, 2019
-
[7]
Sharpness-aware min- imization for efficiently improving generalization
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min- imization for efficiently improving generalization. InInternational Conference on Learning Representations, 2021
work page 2021
-
[8]
Asam: Adaptive sharpness- aware minimization for scale-invariant learning of deep neural networks
Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness- aware minimization for scale-invariant learning of deep neural networks. InInternational conference on machine learning, pages 5905–5914. PMLR, 2021
work page 2021
-
[9]
Fisher sam: Information geometry and sharpness aware minimisation
Minyoung Kim, Da Li, Shell X Hu, and Timothy Hospedales. Fisher sam: Information geometry and sharpness aware minimisation. InInternational Conference on Machine Learning, pages 11148–11161. PMLR, 2022
work page 2022
-
[10]
Domain-inspired sharpness aware minimization under domain shifts
Ruipeng Zhang, Ziqing Fan, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Domain-inspired sharpness aware minimization under domain shifts. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=I4wB3HA3dJ
work page 2024
-
[11]
Hao Ban, Gokul Ram Subramani, and Kaiyi Ji. Samo: A lightweight sharpness-aware approach for multi-task optimization with joint global-local perturbation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 785–795, 2025
work page 2025
-
[12]
Domain generalization via gradient surgery
Lucas Mansilla, Rodrigo Echeveste, Diego H Milone, and Enzo Ferrante. Domain generalization via gradient surgery. InProceedings of the IEEE/CVF international conference on computer vision, pages 6630–6638, 2021
work page 2021
-
[13]
Gradient matching for domain generalization
Yuge Shi, Jeffrey Seely, Philip Torr, Siddharth N, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=vDwBW49HmO
work page 2022
-
[14]
Gradient alignment for cross-domain face anti-spoofing
Binh M Le and Simon S Woo. Gradient alignment for cross-domain face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 188–199, 2024
work page 2024
-
[15]
Gradient-guided annealing for domain generalization
Aristotelis Ballas and Christos Diou. Gradient-guided annealing for domain generalization. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 20558–20568, 2025
work page 2025
-
[16]
Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 33:5824–5836, 2020
work page 2020
-
[17]
Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning.Advances in neural information processing systems, 34:18878–18890, 2021. 10
work page 2021
-
[18]
Independent component alignment for multi-task learning
Dmitry Senushkin, Nikolay Patakin, Arseny Kuznetsov, and Anton Konushin. Independent component alignment for multi-task learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20083–20093, June 2023
work page 2023
-
[19]
Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and S Yu Philip. Generalizing to unseen domains: A survey on domain general- ization.IEEE transactions on knowledge and data engineering, 35(8):8052–8072, 2022
work page 2022
-
[20]
Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain Generalization: A Survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4396–4415, April 2023. ISSN 1939-3539. doi: 10.1109/TPAMI.2022.3195549. Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence
-
[21]
Yu Zhang and Qiang Yang. A survey on multi-task learning.IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2022. doi: 10.1109/TKDE.2021.3070203
-
[22]
Bingcong Li and Georgios Giannakis. Enhancing sharpness-aware optimization through variance suppression.Advances in Neural Information Processing Systems, 36:70861–70879, 2023
work page 2023
-
[23]
Preconditioned sharpness-aware minimization: Unifying analysis and a novel learning algorithm
Yilang Zhang, Bingcong Li, and Georgios B Giannakis. Preconditioned sharpness-aware minimization: Unifying analysis and a novel learning algorithm. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025
work page 2025
-
[24]
Sharpness-aware gradient matching for domain generalization
Pengfei Wang, Zhaoxiang Zhang, Zhen Lei, and Lei Zhang. Sharpness-aware gradient matching for domain generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3769–3778, 2023
work page 2023
-
[25]
Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024
Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024
-
[26]
Gradient norm aware minimiza- tion seeks first-order flatness and improves generalization
Xingxuan Zhang, Renzhe Xu, Han Yu, Hao Zou, and Peng Cui. Gradient norm aware minimiza- tion seeks first-order flatness and improves generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20247–20257, 2023
work page 2023
-
[27]
Sassha: Sharpness-aware adaptive second-order optimization with stable hessian approximation
Dahun Shin, Dongyeop Lee, Jinseok Chung, and Namhoon Lee. Sassha: Sharpness-aware adaptive second-order optimization with stable hessian approximation. InForty-second Inter- national Conference on Machine Learning, 2025. URLhttps://openreview.net/forum? id=7bgqx5OoVe
work page 2025
-
[28]
Muon: An optimizer for hidden layers in neural networks, 2024
Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/
work page 2024
-
[29]
Shampoo: Preconditioned stochastic tensor optimization
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–1850. PMLR, 2018
work page 2018
-
[30]
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019
work page internal anchor Pith review arXiv 1907
-
[31]
Cambridge university press, 2000
Aad W Van der Vaart.Asymptotic statistics, volume 3. Cambridge university press, 2000
work page 2000
-
[32]
Surrogate gap minimization improves sharpness-aware training
Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha C Dvornek, sekhar tatikonda, James s Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=edONMAnhLu-
work page 2022
-
[33]
arXiv preprint arXiv:2211.05729 , year=
Kaiyue Wen, Tengyu Ma, and Zhiyuan Li. How does sharpness-aware minimization minimize sharpness?, 2023. URLhttps://arxiv.org/abs/2211.05729
-
[34]
Chun-Hua Guo and Nicholas J Higham. A schur–newton method for the matrix \boldmath p th root and its inverse.SIAM Journal on Matrix Analysis and Applications, 28(3):788–804, 2006. 11
work page 2006
-
[35]
Åke Björck and Clazett Bowie. An iterative algorithm for computing the best estimate of an orthogonal matrix.SIAM Journal on Numerical Analysis, 8(2):358–364, 1971
work page 1971
-
[36]
Zdislav Kovarik. Some iterative methods for improving orthonormality.SIAM Journal on Numerical Analysis, 7(3):386–389, 1970
work page 1970
-
[37]
Sharp minima can generalize for deep nets
Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. InInternational Conference on Machine Learning, pages 1019–1028. PMLR, 2017
work page 2017
-
[38]
Aristotelis Ballas and Christos Diou. Towards domain generalization for ecg and eeg classifica- tion: Algorithms and benchmarks.IEEE Transactions on Emerging Topics in Computational Intelligence, 8(1):44–54, 2024. doi: 10.1109/TETCI.2023.3306253
-
[39]
Bayesian learning via stochastic gradient langevin dynamics
Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. InProceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011
work page 2011
-
[40]
Yee Whye Teh, Alexandre Thiéry, and Sebastian J V ollmer. Consistency and fluctuations for stochastic gradient langevin dynamics.Journal of Machine Learning Research, 17(7), 2016
work page 2016
-
[41]
Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. Swad: Domain generalization by seeking flat minima.Advances in Neural Information Processing Systems, 34:22405–22418, 2021
work page 2021
-
[42]
Multi-task learning as a bargaining game
Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-task learning as a bargaining game. InInternational Conference on Machine Learning, pages 16428—-16446. PMLR, 2022
work page 2022
-
[43]
Famo: Fast adaptive multitask optimization
Bo Liu, Yihao Feng, Peter Stone, and Qiang Liu. Famo: Fast adaptive multitask optimization. Advances in Neural Information Processing Systems, 36:57226–57243, 2023
work page 2023
-
[44]
In search of lost domain generalization
Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. InInternational Conference on Learning Representations, 2021
work page 2021
-
[45]
Deeper, broader and artier domain generalization
Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. InProceedings of the IEEE international conference on computer vision, pages 5542–5550, 2017
work page 2017
-
[46]
Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias
Chen Fang, Ye Xu, and Daniel N Rockmore. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. InProceedings of the IEEE International Conference on Computer Vision, pages 1657–1664, 2013
work page 2013
-
[47]
Deep hashing network for unsupervised domain adaptation
Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017
work page 2017
-
[48]
Recognition in terra incognita
Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InProceedings of the European conference on computer vision (ECCV), pages 456–473, 2018
work page 2018
-
[49]
Moment matching for multi-source domain adaptation
Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProceedings of the IEEE/CVF international conference on computer vision, pages 1406–1415, 2019
work page 2019
-
[50]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016
work page 2016
-
[51]
Indoor segmentation and support inference from rgbd images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012. 12
work page 2012
-
[52]
Exploring mode connectivity in krylov subspace for domain generalization
Aodi Li, Liansheng Zhuang, Xiao Long, Houqiang Li, and Shafei Wang. Exploring mode connectivity in krylov subspace for domain generalization. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[53]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[54]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. URL https:...
work page 2021
-
[55]
Fair resource allocation in multi-task learning
Hao Ban and Kaiyi Ji. Fair resource allocation in multi-task learning. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
work page 2024
-
[56]
End-to-end multi-task learning with attention
Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1871–1880, 2019
work page 2019
-
[57]
Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation.IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017
work page 2017
-
[58]
Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization.Advances in neural information processing systems, 31, 2018
work page 2018
-
[59]
Statistical learning theory.NY: Wiley, 1998
V Vapnik. Statistical learning theory.NY: Wiley, 1998
work page 1998
-
[60]
Domain-adversarial training of neural networks
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of machine learning research, 17(59):1–35, 2016
work page 2016
-
[61]
Domain generaliza- tion via conditional invariant representations
Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao. Domain generaliza- tion via conditional invariant representations. InAAAI Conference on Artificial Intelligence, volume 32, 2018
work page 2018
-
[62]
Domain generalization with adversarial feature learning
Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. InComputer Vision and Pattern Recognition, 2018
work page 2018
-
[63]
Self-challenging improves cross- domain generalization.European Conference on Computer Vision, 2020
Zeyi Huang, Haohan Wang, Eric P Xing, and Dong Huang. Self-challenging improves cross- domain generalization.European Conference on Computer Vision, 2020
work page 2020
-
[64]
Deep coral: Correlation alignment for deep domain adaptation
Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. InEuropean Conference on Computer Vision, 2016
work page 2016
-
[65]
Invariant information bottleneck for domain generalization
Bo Li, Yifei Shen, Yezhen Wang, Wenzhen Zhu, Dongsheng Li, Kurt Keutzer, and Han Zhao. Invariant information bottleneck for domain generalization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7399–7407, 2022
work page 2022
-
[66]
Baijiong Lin, Feiyang Ye, Yu Zhang, and Ivor W Tsang. Reasonable effectiveness of random weighting: A litmus test for multi-task learning.arXiv preprint arXiv:2111.10603, 2021
-
[67]
Multi-task learning using uncertainty to weigh losses for scene geometry and semantics
Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018
work page 2018
-
[68]
Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. Just pick a sign: Optimizing deep multitask models with gradient sign dropout.Advances in Neural Information Processing Systems, 33:2039–2050, 2020. 13
work page 2039
-
[69]
Towards impartial multi-task learning
Liyang Liu, Yi Li, Zhanghui Kuang, Jing-Hao Xue, Yimin Chen, Wenming Yang, Qingmin Liao, and Wayne Zhang. Towards impartial multi-task learning. InInternational Conference on Learn- ing Representations, 2021. URLhttps://openreview.net/forum?id=IMPnRXEWpvr
work page 2021
-
[70]
Mitigating gradient bias in multi-objective learning: A provably convergent approach
Heshan Devaka Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, and Tianyi Chen. Mitigating gradient bias in multi-objective learning: A provably convergent approach. InThe eleventh international conference on learning representations, 2023
work page 2023
-
[71]
Beyond losses reweighting: Empowering multi-task learning via the general- ization perspective
Hoang Phan, Lam Tran, Quyen Tran, Ngoc Tran, Tuan Truong, Qi Lei, Nhat Ho, Dinh Phung, and Trung Le. Beyond losses reweighting: Empowering multi-task learning via the general- ization perspective. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2440–2450, 2025
work page 2025
-
[72]
Out-of-distribution generalization via risk ex- trapolation (rex)
David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk ex- trapolation (rex). InInternational conference on machine learning, pages 5815–5826. PMLR, 2021
work page 2021
-
[73]
Fishr: Invariant gradient variances for out-of-distribution generalization
Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for out-of-distribution generalization. InInternational Conference on Machine Learning, pages 18347–18377. PMLR, 2022
work page 2022
-
[74]
Jean-Antoine Désidéri. Multiple-gradient descent algorithm (mgda) for multiobjective opti- mization.Comptes Rendus. Mathématique, 350(5-6):313–318, 2012. 14 Appendix Table of Contents A Extended Related Work 16 A.1 Domain Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Multi-Task Learning . . . . . . . . . . . . . . . . ....
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.