Deep Attention Reweighting: Post-Hoc Attention-Based Feature Aggregation in CNNs for Disentangling Core and Spurious Features under Spurious Correlations

Jingxian Wang; Kin Whye Chew

arxiv: 2605.20732 · v1 · pith:SVM6LO33new · submitted 2026-05-20 · 💻 cs.CV

Deep Attention Reweighting: Post-Hoc Attention-Based Feature Aggregation in CNNs for Disentangling Core and Spurious Features under Spurious Correlations

Kin Whye Chew , Jingxian Wang This is my paper

Pith reviewed 2026-05-21 05:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords spurious correlationsfeature disentanglementattention mechanismsglobal average poolingpost-hoc methodsCNN generalizationDeep Feature Reweighting

0 comments

The pith

Replacing global average pooling with attention-based reweighting allows post-hoc retraining to suppress spurious features before they mix with core ones in CNNs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CNNs trained on datasets with spurious correlations often rely on superficial cues because global average pooling mixes core and irrelevant spatial signals into a single vector. Standard post-hoc fixes like retraining only the classifier head cannot fully separate these signals once they are entangled. Deep Attention Reweighting inserts a trainable attention module that reweights spatial locations across feature maps, suppressing spurious regions before the collapse occurs. When this module is retrained together with the classification head, the resulting model shows higher accuracy on core-feature tests than previous methods. The approach demonstrates that the choice of aggregation layer controls how much spurious information survives into the final representation.

Core claim

The Global Average Pooling layer indiscriminately collapses spatially distinct core and spurious features into one representation, limiting the effectiveness of retraining only the classifier head. Deep Attention Reweighting replaces this pooling with an adaptive weighting of spatial locations across feature maps, enabling selective suppression of spurious features before entanglement. When the new module is retrained jointly with the classification head on a target dataset, it consistently outperforms Deep Feature Reweighting across datasets, metrics, and ablations.

What carries the argument

Deep Attention Reweighting (DAR), a post-hoc attention-based aggregation module that replaces Global Average Pooling and computes adaptive weights for spatial locations in feature maps to suppress spurious signals.

If this is right

Selective spatial suppression before pooling reduces a model's reliance on spurious correlations more effectively than operating on already-entangled features.
The performance advantage of DAR over DFR holds across multiple datasets, evaluation metrics, and ablation settings.
Joint retraining of the aggregation module and head is sufficient to realize the gains without updating the convolutional backbone.
Attention-based aggregation mitigates the specific limitation introduced by fixed global average pooling under spurious correlations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar attention reweighting could be inserted at other aggregation points inside CNNs or in non-CNN vision architectures to limit spurious feature propagation.
Preventing entanglement at the pooling stage might lower the cost of later interventions and encourage training pipelines that preserve spatial distinctions from the start.
Applying the same module during initial training rather than only post-hoc could reveal whether early intervention prevents spurious correlations from forming at all.

Load-bearing premise

The entanglement of core and spurious features is fundamentally caused by the Global Average Pooling layer indiscriminately collapsing spatially distinct features.

What would settle it

Measuring attention weights produced by DAR on held-out examples from a dataset with spatially localized spurious cues; if the weights do not systematically down-weight the spurious spatial regions while accuracy on core-only tests improves, the proposed mechanism is not operating as claimed.

Figures

Figures reproduced from arXiv: 2605.20732 by Jingxian Wang, Kin Whye Chew.

**Figure 1.** Figure 1: Illustration of GAP vs. DAR. The input image from the Dominoes dataset consists of the spurious MNIST image concatenated with the core CIFAR image. After feature extraction by the convolutional layers, we find that the output feature maps are entangled, with each feature map activating both core and spurious features at distinct spatial locations. GAP uniformly averages these feature maps across spatial lo… view at source ↗

**Figure 2.** Figure 2: Histogram of CEP values across 512 output feature maps for various methods. Figures (a), (b), and (c) analyze the feature maps as a whole, whereas Figure (d) analyzes the feature maps at the pixel level. Refer to Section 3.2 for a detailed analysis. (a) ERM_{Core} (b) ERM (c) DFR_{FC} (d) DAR (e) DAR_{Spu} [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Histogram of CEP values across 512 output features for various baseline methods. Refer to Section 3.2, 4.5, and 5.1 for a detailed analysis. \label {eqn:csp} \text {CEP} = \mathbb {E}_{\mathbf {x}}\!\left [ \frac {E_{\text {core}}(\mathbf {x})}{E_{\text {core}}(\mathbf {x}) + E_{\text {spu}}(\mathbf {x})} \right ] \times 100\% . (3) High CEP (≈ 100%) indicates reliance on core features; low CEP (≈ 0%) ind… view at source ↗

**Figure 4.** Figure 4: Post-hoc retraining architecture ablations. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Ablations for method characterization. (a) Feature learning compatibility. (b) Complete spatial overlap (CMNIST) robustness. (c) CNN architecture generality. Attention Architecture Ablation. Figure 4a validates the attention-module design in Section 4.3 by ablating one component at a time from the proposed architecture. The proposed design performs best, and every simplification degrades performance. Thes… view at source ↗

**Figure 6.** Figure 6: Histogram of CAP values across 512 output feature maps. bottom half of the feature map corresponds to the CIFAR input. We compute the Core Activation Percentage (CAP) for the j-th feature map as follows: \label {eqn:cap} CAP_j = \mathbb {E}_{i}\left [ \frac {\sum _{h=H/2}^{H}\sum _{w}^{W}|\mathbf {A}_i[j, h, w]|}{\sum _{h=1}^{H}\sum _{w=1}^{W}|\mathbf {A}_i[j, h, w]|}\right ] * 100\% (6) where H and W are … view at source ↗

**Figure 7.** Figure 7: A random sample of 16 GradCAM images from the test datasets for ERM, DF R, DAR, DARSpu models that were obtained from the main experiments for the Dominoes dataset. 3. DFR: While DFR improves the CGP score (CGP = 70.0%, [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

read the original abstract

Convolutional Neural Networks (CNNs) often exploit spurious correlations in datasets, learning superficially predictive yet causally irrelevant features, leading to poor generalization and fairness issues. Deep Feature Reweighting (DFR) is a post-hoc technique that reduces a trained model's reliance on spurious correlations by retraining its classification head on a target dataset. However, we show that DFR is fundamentally constrained by operating on entangled features, limiting its ability to amplify the core features while simultaneously suppressing the spurious ones. We trace this entanglement to the ubiquitous Global Average Pooling (GAP) layer, which indiscriminately collapses spatially distinct core and spurious features into a single representation. To address this, we propose Deep Attention Reweighting (DAR), a post-hoc attention-based aggregation module that replaces GAP and is retrained jointly with the classification head. DAR computes an adaptive weighting of spatial locations across feature maps, enabling selective suppression of spurious features before the collapse into entangled features. Across various datasets, metrics, and ablations, DAR consistently outperforms DFR, demonstrating that our attention-based aggregation mitigates GAP-induced entanglement and reduces spurious reliance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAR swaps GAP for attention in the DFR pipeline and reports gains, but the selective suppression story rests on an unverified assumption about what the attention actually learns.

read the letter

The core move here is straightforward: take the DFR setup, freeze the backbone, and replace global average pooling with a learned attention module that reweights spatial locations before the features go into the retrained head. The authors argue this lets the method suppress spurious signals that would otherwise get mixed in by indiscriminate averaging. That is a reasonable incremental idea if the bottleneck really is the pooling step rather than something earlier in the network.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Deep Attention Reweighting (DAR), a post-hoc module that replaces Global Average Pooling (GAP) in a frozen CNN backbone. DAR is retrained jointly with the classification head on a target dataset to adaptively weight spatial locations in feature maps, with the goal of selectively suppressing spurious features before they collapse into an entangled representation. The central claim is that this addresses a fundamental limitation of Deep Feature Reweighting (DFR), which operates on already-entangled features, and that DAR yields consistent improvements over DFR across datasets, metrics, and ablations.

Significance. If the mechanistic claim holds, the work offers a lightweight, architecture-compatible improvement to post-hoc debiasing methods for CNNs, with potential benefits for OOD generalization and fairness. The empirical scope (multiple datasets, ablations, and direct comparison to DFR) is a strength; however, the absence of direct evidence that attention maps perform the claimed selective suppression limits the interpretability of the gains.

major comments (2)

[Abstract, §3] Abstract and §3 (DAR formulation): the claim that DAR 'enables selective suppression of spurious features before the collapse' is load-bearing for the paper's contribution over DFR, yet the experiments provide no inspection of attention maps, no correlation with core/spurious region masks, and no control experiment isolating whether gains arise from selective suppression versus generic spatial reweighting or added capacity.
[§4] §4 (experimental results): while consistent outperformance versus DFR is reported, the absence of attention-map analysis or quantitative differential weighting metrics means the central explanation (mitigation of GAP-induced entanglement via selective suppression) remains unverified; this must be addressed before the mechanistic interpretation can be accepted.

minor comments (2)

[§3] Notation for the attention weight computation (likely Eq. (X) in §3) should explicitly state whether the attention module shares parameters with the backbone or is trained from scratch, and whether any regularization is applied to encourage sparsity or selectivity.
[Figures in §4] Figure captions and axis labels in the ablation plots could be expanded to clarify which metrics correspond to core-feature accuracy versus spurious-feature suppression.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which help clarify the need for stronger mechanistic evidence. We address each major point below and have incorporated revisions to include attention map analyses, quantitative metrics, and control experiments.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (DAR formulation): the claim that DAR 'enables selective suppression of spurious features before the collapse' is load-bearing for the paper's contribution over DFR, yet the experiments provide no inspection of attention maps, no correlation with core/spurious region masks, and no control experiment isolating whether gains arise from selective suppression versus generic spatial reweighting or added capacity.

Authors: We agree that direct inspection of the attention mechanism is necessary to substantiate the selective suppression claim. In the revised manuscript, we have added visualizations of the learned attention maps on datasets with available core/spurious region annotations (e.g., Waterbirds and CelebA), along with quantitative correlations between attention weights and ground-truth masks. We also include a new control experiment comparing DAR against a non-adaptive spatial reweighting baseline (fixed uniform weights plus added capacity) and a random attention variant. These results show that performance gains are attributable to adaptive, selective weighting rather than generic reweighting or capacity alone, and we have updated the abstract and §3 to reference these findings. revision: yes
Referee: [§4] §4 (experimental results): while consistent outperformance versus DFR is reported, the absence of attention-map analysis or quantitative differential weighting metrics means the central explanation (mitigation of GAP-induced entanglement via selective suppression) remains unverified; this must be addressed before the mechanistic interpretation can be accepted.

Authors: We acknowledge that the original experiments lacked direct verification of the proposed mechanism. The revised §4 now incorporates attention-map analysis across all evaluated datasets and introduces quantitative differential weighting metrics, specifically the mean attention ratio on core versus spurious regions (computed using available annotations or proxy masks derived from dataset structure). These metrics demonstrate statistically higher weighting on core features under DAR compared to GAP, supporting the mitigation of entanglement. New figures and tables present these results alongside the existing performance comparisons, and we have added a brief discussion of how this evidence strengthens the interpretation over DFR. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal with no derivation chain reducing to fitted inputs or self-citations by construction.

full rationale

The paper proposes DAR as a post-hoc attention module replacing GAP, retrained with the classification head, and evaluates it empirically against DFR on datasets. The abstract and provided text contain no equations, no fitted parameters renamed as predictions, no self-citations invoked as uniqueness theorems, and no ansatz smuggled via prior work. The central claim (attention enables selective suppression before collapse) is supported by experimental comparisons rather than any self-referential reduction. This matches the default case of a self-contained empirical contribution with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the assumption that GAP is the primary source of feature entanglement and that a trainable attention module can selectively suppress spurious spatial locations. No explicit free parameters beyond standard training are detailed. The attention module is the main invented component.

axioms (1)

domain assumption Global Average Pooling indiscriminately collapses spatially distinct core and spurious features into entangled representations
Directly stated in the abstract as the root cause limiting DFR.

invented entities (1)

Deep Attention Reweighting (DAR) module no independent evidence
purpose: Adaptive weighting of spatial locations in feature maps to suppress spurious features before pooling
New post-hoc attention-based aggregation introduced to replace GAP.

pith-pipeline@v0.9.0 · 5737 in / 1354 out tokens · 45793 ms · 2026-05-21T05:01:23.781146+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 11 internal anchors

[1]

In: III, H.D., Singh, A

Ahuja, K., Shanmugam, K., Varshney, K., Dhurandhar, A.: Invariant risk min- imization games. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th Inter- national Conference on Machine Learning. Proceedings of Machine Learning Re- search, vol. 119, pp. 145–155. PMLR (13–18 Jul 2020),https://proceedings. mlr.press/v119/ahuja20a.html

work page 2020
[2]

Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization (2020)

work page 2020
[3]

Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented con- volutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019),https://openaccess.thecvf.com/content_ ICCV_2019/html/Bello_Attention_Augmented_Convolutional_Networks_ICCV_ 2019_paper.html

work page 2019
[4]

doi: 10.1109/TPAMI.2013.50

Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence (2013).https://doi.org/10.1109/TPAMI.2013.50

work page doi:10.1109/tpami.2013.50 2013
[5]

Burgess, C.P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., Ler- chner, A.: Understanding disentangling inβ-vae (2018),https://arxiv.org/abs/ 1804.03599

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

IEEE Transactions on Neural Networks and Learning Systems 35(7), 8747–8761 (2024).https://doi.org/10.1109/TNNLS.2022.3218982

Carbonneau, M.A., Zaïdi, J., Boilard, J., Gagnon, G.: Measuring disentanglement: A review of metrics. IEEE Transactions on Neural Networks and Learning Systems 35(7), 8747–8761 (2024).https://doi.org/10.1109/TNNLS.2022.3218982

work page doi:10.1109/tnnls.2022.3218982 2024
[7]

Chen, A.S., Lee, Y., Setlur, A., Levine, S., Finn, C.: Confidence-based model se- lection: When to take shortcuts for subpopulation shifts (2023)

work page 2023
[8]

Chen, R.T.Q., Li, X., Grosse, R., Duvenaud, D.: Isolating sources of disentangle- ment in variational autoencoders (2019),https://arxiv.org/abs/1802.04942

work page internal anchor Pith review Pith/arXiv arXiv 2019
[9]

IEEE Signal Processing Magazine29(6), 141–142 (2012)

Deng, L.: The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine29(6), 141–142 (2012)

work page 2012
[10]

IEEE Transactions on Multimedia24, 2407–2421 (2022).https://doi.org/10.1109/ TMM.2021.3080516

Deng, W., Zhao, L., Liao, Q., Guo, D., Kuang, G., Hu, D., Pietikäinen, M., Liu, L.: Informative feature disentanglement for unsupervised domain adaptation. IEEE Transactions on Multimedia24, 2407–2421 (2022).https://doi.org/10.1109/ TMM.2021.3080516

work page arXiv 2022
[11]

In: International Conference on Learning Representations (ICLR) (2021),https: //openreview.net/forum?id=YicbFdNTTy

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021),https: //openreview.net/forum?id=YicbFdNTTy

work page 2021
[12]

Dupont, E.: Learning disentangled joint continuous and discrete representations (2018),https://arxiv.org/abs/1804.00104

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Shortcut Learning in Deep Neural Networks , journal =

Geirhos, R., Jacobsen, J., Michaelis, C., Zemel, R.S., Brendel, W., Bethge, M., Wichmann, F.A.: Shortcut learning in deep neural networks. CoRR abs/2004.07780(2020),https://arxiv.org/abs/2004.07780

work page arXiv 2004
[14]

Ghosal, S.S., Ming, Y., Li, Y.: Are vision transformers robust to spurious correla- tions? (2022),https://arxiv.org/abs/2203.09125

work page arXiv 2022
[15]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(1), 87–110 (2023)

Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y., Tao, D.: A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence45(1), 87–110 (2023). https://doi.org/10.1109/TPAMI.2022.3152247 16 Chew and Wang

work page doi:10.1109/tpami.2022.3152247 2023
[16]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016),https://openaccess.thecvf.com/content_cvpr_ 2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html

work page 2016
[17]

Heinze-Deml, C., Peters, J., Meinshausen, N.: Invariant causal prediction for non- linear models (2018)

work page 2018
[18]

In: International Conference on Learning Repre- sentations (ICLR) (2017)

Higgins, I., Matthey, L., Pal, A., Burgess, C.P., Glorot, X., Botvinick, M., Mo- hamed, S., Lerchner, A.: Beta-vae: Learning basic visual concepts with a con- strained variational framework. In: International Conference on Learning Repre- sentations (ICLR) (2017)

work page 2017
[19]

Higgins, I., Sonnerat, N., Matthey, L., Pal, A., Burgess, C.P., Bosnjak, M., Shana- han, M., Botvinick, M., Hassabis, D., Lerchner, A.: Scan: Learning hierarchical compositional visual concepts (2018),https://arxiv.org/abs/1707.03389

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018), https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze- and- Excitation_Networks_CVPR_2018_paper.html

work page 2018
[21]

Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) (2017),https://openaccess.thecvf.com/ content _ cvpr _ 2017 / html / Huang _ Densely _ Connected _ Convolutional _ CVPR _ 2017_paper.html

work page 2017
[22]

Huang, Z., Wang, H., Xing, E.P., Huang, D.: Self-challenging improves cross- domain generalization (2020),https://arxiv.org/abs/2007.02454

work page arXiv 2020
[23]

In: International Conference on Ar- tificial Intelligence and Statistics (AISTATS) (2022),https://proceedings.mlr

Idrissi, B., Arjovsky, M., Pezeshki, M., Lopez-Paz, D.: Simple data balancing achieves competitive worst-group-accuracy. In: International Conference on Ar- tificial Intelligence and Statistics (AISTATS) (2022),https://proceedings.mlr. press/v177/idrissi22a.html

work page 2022
[24]

In: In- ternational Conference on Learning Representations (ICLR) (2018),https:// openreview.net/forum?id=HkG3SJZ1D

Jetley, S., Lord, N.A., Lee, N., Torr, P.H.S.: Learn to pay attention. In: In- ternational Conference on Learning Representations (ICLR) (2018),https:// openreview.net/forum?id=HkG3SJZ1D

work page 2018
[25]

Joshi, S., Yang, Y., Xue, Y., Yang, W., Mirzasoleiman, B.: Towards mitigating spurious correlations in the wild: A benchmark and a more realistic dataset (2023)

work page 2023
[26]

Kim, H., Mnih, A.: Disentangling by factorising (2019),https://arxiv.org/abs/ 1802.05983

work page internal anchor Pith review Pith/arXiv arXiv 2019
[27]

Kim, M., Wang, Y., Sahu, P., Pavlovic, V.: Relevance factor vae: Learning and identifying disentangled factors (2019),https://arxiv.org/abs/1902.01568

work page internal anchor Pith review Pith/arXiv arXiv 2019
[28]

In: International Conference on Learning Representations (ICLR)

Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR). San Diego, CA, USA (2015)

work page 2015
[29]

Kingma,D.P.,Welling,M.:Auto-encodingvariationalbayes.In:InternationalCon- ference on Learning Representations (ICLR) (2014),https://arxiv.org/abs/ 1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2014
[30]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=Zb6c8A- Fghk

Kirichenko, P., Izmailov, P., Wilson, A.G.: Last layer re-training is sufficient for robustness to spurious correlations. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=Zb6c8A- Fghk

work page 2023
[31]

In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S

Kong, L., Xie, S., Yao, W., Zheng, Y., Chen, G., Stojanov, P., Akinwande, V., Zhang, K.: Partial disentanglement for domain adaptation. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning. Proceedings of Machine Deep Attention Reweighting 17 Learning Rese...

work page 2022
[32]

Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. Rep. 0, University of Toronto, Toronto, Ontario (2009),https://www.cs. toronto.edu/~kriz/learning-features-2009-TR.pdf

work page 2009
[33]

Kumar, A., Sattigeri, P., Balakrishnan, A.: Variational inference of disentangled la- tent concepts from unlabeled observations (2018),https://arxiv.org/abs/1711. 00848

work page 2018
[34]

In: Oh, A., Neumann, T., Glober- son, A., Saenko, K., Hardt, M., Levine, S

LaBonte, T., Muthukumar, V., Kumar, A.: Towards last-layer retraining for group robustness with fewer annotations. In: Oh, A., Neumann, T., Glober- son, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Infor- mation Processing Systems. vol. 36, pp. 11552–11579. Curran Associates, Inc. (2023),https : / / proceedings . neurips . cc / paper _ ...

work page 2023
[35]

Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and think like people (2016)

work page 2016
[36]

Lee, S., Cho, S., Im, S.: Dranet: Disentangling representation and adaptation net- worksforunsupervisedcross-domainadaptation.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15252– 15261 (June 2021)

work page 2021
[37]

Lee, Y., Yao, H., Finn, C.: Diversify and disambiguate: Learning from underspec- ified data (2023)

work page 2023
[38]

Levy, D., Carmon, Y., Duchi, J.C., Sidford, A.: Large-scale methods for distribu- tionally robust optimization (2020),https://arxiv.org/abs/2010.05893

work page arXiv 2020
[39]

Li, Z., Evtimov, I., Gordo, A., Hazirbas, C., Hassner, T., Ferrer, C.C., Xu, C., Ibrahim, M.: A whac-a-mole dilemma: Shortcuts come in multiples where mitigat- ing one amplifies others (2023),https://arxiv.org/abs/2212.04825

work page arXiv 2023
[40]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Liang, W., Mao, Y., Kwon, Y., Yang, X., Zou, J.: Accuracy on the curve: On the nonlinear correlation of ML performance between data subpopulations. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceed- ings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 20...

work page 2023
[41]

In: International Conference on Learning Representations (ICLR) (2014),https://openreview.net/forum?id= ylE6yojDR5yqX

Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on Learning Representations (ICLR) (2014),https://openreview.net/forum?id= ylE6yojDR5yqX

work page 2014
[42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Lin, Y., Dong, H., Wang, H., Zhang, T.: Bayesian invariant risk minimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16021–16030 (June 2022)

work page 2022
[43]

In: International Conference on Machine Learning (ICML) (2021)

Liu, S., Beery, S., Teney, D., Liu, S., van den Hengel, A., Gould, S.: Just train twice: Improving group robustness without training group information. In: International Conference on Machine Learning (ICML) (2021)

work page 2021
[44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021),https : / / openaccess

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021),https : / / openaccess . thecvf . com / content / ICCV2021 / html / Liu _ Swin _ Transformer _ Hierarchical _ Vision _ Tr...

work page 2021
[45]

Deep Learning Face Attributes in the Wild

Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015),https://arxiv.org/abs/1411.7766 18 Chew and Wang

work page internal anchor Pith review Pith/arXiv arXiv 2015
[46]

Locatello, F., Tschannen, M., Bauer, S., Rätsch, G., Schölkopf, B., Bachem, O.: Disentangling factors of variation using few labels (2020),https://arxiv.org/ abs/1905.01258

work page arXiv 2020
[47]

Lopez-Paz, D.: From dependence to causation (2016)

work page 2016
[48]

Lynch, A., Dovonon, G.J.S., Kaddour, J., Silva, R.: Spawrious: A benchmark for fine control of spurious correlation biases (2023)

work page 2023
[49]

Marcus, G.: Deep learning: A critical appraisal (2018)

work page 2018
[50]

Mathieu, E., Rainforth, T., Siddharth, N., Teh, Y.W.: Disentangling disentangle- ment in variational autoencoders (2019),https://arxiv.org/abs/1812.02833

work page internal anchor Pith review Pith/arXiv arXiv 2019
[51]

Tesseract: A search-based decoder for quantum error correction.arXiv preprint arXiv:2503.10988, 2025

Nagarajan, V., Andreassen, A., Neyshabur, B.: Understanding the failure modes of out-of-distribution generalization (2020).https://doi.org/10.48550/ARXIV. 2010.15775,https://arxiv.org/abs/2010.15775

work page internal anchor Pith review doi:10.48550/arxiv 2020
[52]

Nam,J.,Cha,H.,Ahn,S.,Lee,J.,Shin,J.:Learningfromfailure:Trainingdebiased classifier from biased classifier (2020)

work page 2020
[53]

Pagliardini, M., Jaggi, M., Fleuret, F., Karimireddy, S.P.: Agree to disagree: Di- versity through disagreement for better transferability (2022)

work page 2022
[54]

Pearl, J.: The do-calculus revisited (2012),https://arxiv.org/abs/1210.4852

work page internal anchor Pith review Pith/arXiv arXiv 2012
[55]

Peters, J., Bühlmann, P., Meinshausen, N.: Causal inference using invariant pre- diction: identification and confidence intervals (2015)

work page 2015
[56]

Gradient Starvation:

Pezeshki, M., Kaba, S., Bengio, Y., Courville, A.C., Precup, D., Lajoie, G.: Gradi- ent starvation: A learning proclivity in neural networks. CoRRabs/2011.09468 (2020),https://arxiv.org/abs/2011.09468

work page arXiv 2011
[57]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Qiu, S., Potapczynski, A., Izmailov, P., Wilson, A.G.: Simple and fast group ro- bustness by automatic feature reweighting. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th Inter- national Conference on Machine Learning. Proceedings of Machine Learning Re- search, vol. 202, pp. 28448–28467. PM...

work page 2023
[58]

In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=ryxGuJrFvS

Sagawa*, S., Koh*, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neu- ral networks. In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=ryxGuJrFvS

work page 2020
[59]

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR) (2018),https: //openaccess.thecvf.com/content_cvpr_2018/html/Sandler_MobileNetV2_ Inverted_Residuals_CVPR_2018_paper.html

work page 2018
[60]

doi:10.1007/s11263-019-01228-7

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- CAM: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision128(2), 336–359 (oct 2019).https: //doi.org/10.1007/s11263-019-01228-7,https://doi.org/10.1007/s11263- 019-01228-7

work page doi:10.1007/s11263-019-01228-7 2019
[61]

In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

Shah, H., Tamuly, K., Raghunathan, A., Jain, P., Netrapalli, P.: The pitfalls of simplicity bias in neural networks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 9573–9585. Curran Associates, Inc. (2020),https://proceedings. neurips.cc/paper/2020/file/6cfe0e6127fa2...

work page 2020
[62]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015) Deep Attention Reweighting 19

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015) Deep Attention Reweighting 19

work page 2015
[63]

Taghanaki, S.A., Khani, A., Khani, F., Gholami, A., Tran, L., Mahdavi-Amiri, A., Hamarneh, G.: Masktune: Mitigating spurious correlations by forcing to explore (2022)

work page 2022
[64]

In: International Conference on Machine Learning (ICML) (2019), https://proceedings.mlr.press/v97/tan19a.html

Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neu- ral networks. In: International Conference on Machine Learning (ICML) (2019), https://proceedings.mlr.press/v97/tan19a.html

work page 2019
[65]

In: Interna- tional Conference on Machine Learning (ICML) (2021),https://proceedings

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J’egou, H.: Train- ing data-efficient image transformers & distillation through attention. In: Interna- tional Conference on Machine Learning (ICML) (2021),https://proceedings. mlr.press/v139/touvron21a.html

work page 2021
[66]

Träuble, F., Creager, E., Kilbertus, N., Locatello, F., Dittadi, A., Goyal, A., Schölkopf, B., Bauer, S.: On disentangled representations learned from correlated data (2021),https://arxiv.org/abs/2006.07886

work page arXiv 2021
[67]

In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017),https://proceedings.neurips....

work page 2017
[68]

Wang, T., Zhou, C., Sun, Q., Zhang, H.: Causal attention for unbiased visual recognition (2021),https://arxiv.org/abs/2108.08782

work page arXiv 2021
[69]

Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions (2021),https://arxiv.org/abs/2102.12122

work page arXiv 2021
[70]

In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR)

Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR). pp. 7794–7803 (2018),https://openaccess.thecvf.com/content_ cvpr_2018/html/Wang_Non-Local_Neural_Networks_CVPR_2018_paper.html

work page 2018
[71]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 9677–9696 (2024).https://doi.org/10.1109/TPAMI.2024.3420937

Wang, X., Chen, H., Tang, S., Wu, Z., Zhu, W.: Disentangled representation learn- ing. IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 9677–9696 (2024).https://doi.org/10.1109/TPAMI.2024.3420937

work page doi:10.1109/tpami.2024.3420937 2024
[72]

Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-ucsd birds 200. Tech. Rep. CNS-TR-2010-001, California Institute of Technology (2010)

work page 2010
[73]

J., 2022, in Bambi C., Santangelo A., eds, , Handbook of X-ray and Gamma-ray Astrophysics

Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block atten- tion module. In: Computer Vision – ECCV 2018. pp. 3–19 (2018).https:// doi.org/10.1007/978- 3- 030- 01234- 2_1,https://openaccess.thecvf.com/ content _ ECCV _ 2018 / html / Sanghyun _ Woo _ Convolutional _ Block _ Attention _ Module_ECCV_2018_paper.html

work page doi:10.1007/978- 2018
[74]

Wu, B., Xu, C., Dai, X., Wan, A., Zhang, P., Yan, Z., Tomizuka, M., Gonzalez, J., Keutzer, K., Vajda, P.: Visual transformers: Token-based image representation and processing for computer vision (2020),https://arxiv.org/abs/2006.03677

work page arXiv 2020
[75]

Yang, X., Zhang, H., Qi, G., Cai, J.: Causal attention for vision-language tasks (2021),https://arxiv.org/abs/2103.03493

work page arXiv 2021
[76]

In: Proceedings of the 40th International Conference on Machine Learning

Yang, Y., Zhang, H., Katabi, D., Ghassemi, M.: Change is hard: a closer look at subpopulation shift. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)

work page 2023
[77]

Ye, W., Zheng, G., Cao, X., Ma, Y., Hu, X., Zhang, A.: Spurious correlations in machine learning: A survey (2024) 20 Chew and Wang

work page 2024
[78]

Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., Tay, F.E., Feng, J., Yan, S.: Tokens-to-token vit: Training vision transformers from scratch on imagenet (2021),https://arxiv.org/abs/2101.11986

work page arXiv 2021
[79]

In: 2024 7th International Conference on Artificial Intelli- gence and Big Data (ICAIBD)

Yue, D., Zou, J., Jin, X., Leng, T.: Causal inference for confounder-purify vi- sion transformers. In: 2024 7th International Conference on Artificial Intelli- gence and Big Data (ICAIBD). pp. 530–537 (2024).https://doi.org/10.1109/ ICAIBD62003.2024.10604648

work page arXiv 2024
[80]

In: International Conference on Machine Learning (ICML) (2022),https : / / proceedings

Zhang, M., Jia, R., Misra, D.: Correct-n-contrast: A contrastive approach for improving robustness to spurious correlations. In: International Conference on Machine Learning (ICML) (2022),https : / / proceedings . mlr . press / v162 / zhang22c.html

work page 2022

Showing first 80 references.

[1] [1]

In: III, H.D., Singh, A

Ahuja, K., Shanmugam, K., Varshney, K., Dhurandhar, A.: Invariant risk min- imization games. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th Inter- national Conference on Machine Learning. Proceedings of Machine Learning Re- search, vol. 119, pp. 145–155. PMLR (13–18 Jul 2020),https://proceedings. mlr.press/v119/ahuja20a.html

work page 2020

[2] [2]

Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization (2020)

work page 2020

[3] [3]

Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented con- volutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019),https://openaccess.thecvf.com/content_ ICCV_2019/html/Bello_Attention_Augmented_Convolutional_Networks_ICCV_ 2019_paper.html

work page 2019

[4] [4]

doi: 10.1109/TPAMI.2013.50

Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence (2013).https://doi.org/10.1109/TPAMI.2013.50

work page doi:10.1109/tpami.2013.50 2013

[5] [5]

Burgess, C.P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., Ler- chner, A.: Understanding disentangling inβ-vae (2018),https://arxiv.org/abs/ 1804.03599

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

IEEE Transactions on Neural Networks and Learning Systems 35(7), 8747–8761 (2024).https://doi.org/10.1109/TNNLS.2022.3218982

Carbonneau, M.A., Zaïdi, J., Boilard, J., Gagnon, G.: Measuring disentanglement: A review of metrics. IEEE Transactions on Neural Networks and Learning Systems 35(7), 8747–8761 (2024).https://doi.org/10.1109/TNNLS.2022.3218982

work page doi:10.1109/tnnls.2022.3218982 2024

[7] [7]

Chen, A.S., Lee, Y., Setlur, A., Levine, S., Finn, C.: Confidence-based model se- lection: When to take shortcuts for subpopulation shifts (2023)

work page 2023

[8] [8]

Chen, R.T.Q., Li, X., Grosse, R., Duvenaud, D.: Isolating sources of disentangle- ment in variational autoencoders (2019),https://arxiv.org/abs/1802.04942

work page internal anchor Pith review Pith/arXiv arXiv 2019

[9] [9]

IEEE Signal Processing Magazine29(6), 141–142 (2012)

Deng, L.: The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine29(6), 141–142 (2012)

work page 2012

[10] [10]

IEEE Transactions on Multimedia24, 2407–2421 (2022).https://doi.org/10.1109/ TMM.2021.3080516

Deng, W., Zhao, L., Liao, Q., Guo, D., Kuang, G., Hu, D., Pietikäinen, M., Liu, L.: Informative feature disentanglement for unsupervised domain adaptation. IEEE Transactions on Multimedia24, 2407–2421 (2022).https://doi.org/10.1109/ TMM.2021.3080516

work page arXiv 2022

[11] [11]

In: International Conference on Learning Representations (ICLR) (2021),https: //openreview.net/forum?id=YicbFdNTTy

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021),https: //openreview.net/forum?id=YicbFdNTTy

work page 2021

[12] [12]

Dupont, E.: Learning disentangled joint continuous and discrete representations (2018),https://arxiv.org/abs/1804.00104

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Shortcut Learning in Deep Neural Networks , journal =

Geirhos, R., Jacobsen, J., Michaelis, C., Zemel, R.S., Brendel, W., Bethge, M., Wichmann, F.A.: Shortcut learning in deep neural networks. CoRR abs/2004.07780(2020),https://arxiv.org/abs/2004.07780

work page arXiv 2004

[14] [14]

Ghosal, S.S., Ming, Y., Li, Y.: Are vision transformers robust to spurious correla- tions? (2022),https://arxiv.org/abs/2203.09125

work page arXiv 2022

[15] [15]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(1), 87–110 (2023)

Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y., Tao, D.: A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence45(1), 87–110 (2023). https://doi.org/10.1109/TPAMI.2022.3152247 16 Chew and Wang

work page doi:10.1109/tpami.2022.3152247 2023

[16] [16]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016),https://openaccess.thecvf.com/content_cvpr_ 2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html

work page 2016

[17] [17]

Heinze-Deml, C., Peters, J., Meinshausen, N.: Invariant causal prediction for non- linear models (2018)

work page 2018

[18] [18]

In: International Conference on Learning Repre- sentations (ICLR) (2017)

Higgins, I., Matthey, L., Pal, A., Burgess, C.P., Glorot, X., Botvinick, M., Mo- hamed, S., Lerchner, A.: Beta-vae: Learning basic visual concepts with a con- strained variational framework. In: International Conference on Learning Repre- sentations (ICLR) (2017)

work page 2017

[19] [19]

Higgins, I., Sonnerat, N., Matthey, L., Pal, A., Burgess, C.P., Bosnjak, M., Shana- han, M., Botvinick, M., Hassabis, D., Lerchner, A.: Scan: Learning hierarchical compositional visual concepts (2018),https://arxiv.org/abs/1707.03389

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018), https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze- and- Excitation_Networks_CVPR_2018_paper.html

work page 2018

[21] [21]

Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) (2017),https://openaccess.thecvf.com/ content _ cvpr _ 2017 / html / Huang _ Densely _ Connected _ Convolutional _ CVPR _ 2017_paper.html

work page 2017

[22] [22]

Huang, Z., Wang, H., Xing, E.P., Huang, D.: Self-challenging improves cross- domain generalization (2020),https://arxiv.org/abs/2007.02454

work page arXiv 2020

[23] [23]

In: International Conference on Ar- tificial Intelligence and Statistics (AISTATS) (2022),https://proceedings.mlr

Idrissi, B., Arjovsky, M., Pezeshki, M., Lopez-Paz, D.: Simple data balancing achieves competitive worst-group-accuracy. In: International Conference on Ar- tificial Intelligence and Statistics (AISTATS) (2022),https://proceedings.mlr. press/v177/idrissi22a.html

work page 2022

[24] [24]

In: In- ternational Conference on Learning Representations (ICLR) (2018),https:// openreview.net/forum?id=HkG3SJZ1D

Jetley, S., Lord, N.A., Lee, N., Torr, P.H.S.: Learn to pay attention. In: In- ternational Conference on Learning Representations (ICLR) (2018),https:// openreview.net/forum?id=HkG3SJZ1D

work page 2018

[25] [25]

Joshi, S., Yang, Y., Xue, Y., Yang, W., Mirzasoleiman, B.: Towards mitigating spurious correlations in the wild: A benchmark and a more realistic dataset (2023)

work page 2023

[26] [26]

Kim, H., Mnih, A.: Disentangling by factorising (2019),https://arxiv.org/abs/ 1802.05983

work page internal anchor Pith review Pith/arXiv arXiv 2019

[27] [27]

Kim, M., Wang, Y., Sahu, P., Pavlovic, V.: Relevance factor vae: Learning and identifying disentangled factors (2019),https://arxiv.org/abs/1902.01568

work page internal anchor Pith review Pith/arXiv arXiv 2019

[28] [28]

In: International Conference on Learning Representations (ICLR)

Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR). San Diego, CA, USA (2015)

work page 2015

[29] [29]

Kingma,D.P.,Welling,M.:Auto-encodingvariationalbayes.In:InternationalCon- ference on Learning Representations (ICLR) (2014),https://arxiv.org/abs/ 1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2014

[30] [30]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=Zb6c8A- Fghk

Kirichenko, P., Izmailov, P., Wilson, A.G.: Last layer re-training is sufficient for robustness to spurious correlations. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=Zb6c8A- Fghk

work page 2023

[31] [31]

In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S

Kong, L., Xie, S., Yao, W., Zheng, Y., Chen, G., Stojanov, P., Akinwande, V., Zhang, K.: Partial disentanglement for domain adaptation. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning. Proceedings of Machine Deep Attention Reweighting 17 Learning Rese...

work page 2022

[32] [32]

Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. Rep. 0, University of Toronto, Toronto, Ontario (2009),https://www.cs. toronto.edu/~kriz/learning-features-2009-TR.pdf

work page 2009

[33] [33]

Kumar, A., Sattigeri, P., Balakrishnan, A.: Variational inference of disentangled la- tent concepts from unlabeled observations (2018),https://arxiv.org/abs/1711. 00848

work page 2018

[34] [34]

In: Oh, A., Neumann, T., Glober- son, A., Saenko, K., Hardt, M., Levine, S

LaBonte, T., Muthukumar, V., Kumar, A.: Towards last-layer retraining for group robustness with fewer annotations. In: Oh, A., Neumann, T., Glober- son, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Infor- mation Processing Systems. vol. 36, pp. 11552–11579. Curran Associates, Inc. (2023),https : / / proceedings . neurips . cc / paper _ ...

work page 2023

[35] [35]

Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and think like people (2016)

work page 2016

[36] [36]

Lee, S., Cho, S., Im, S.: Dranet: Disentangling representation and adaptation net- worksforunsupervisedcross-domainadaptation.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15252– 15261 (June 2021)

work page 2021

[37] [37]

Lee, Y., Yao, H., Finn, C.: Diversify and disambiguate: Learning from underspec- ified data (2023)

work page 2023

[38] [38]

Levy, D., Carmon, Y., Duchi, J.C., Sidford, A.: Large-scale methods for distribu- tionally robust optimization (2020),https://arxiv.org/abs/2010.05893

work page arXiv 2020

[39] [39]

Li, Z., Evtimov, I., Gordo, A., Hazirbas, C., Hassner, T., Ferrer, C.C., Xu, C., Ibrahim, M.: A whac-a-mole dilemma: Shortcuts come in multiples where mitigat- ing one amplifies others (2023),https://arxiv.org/abs/2212.04825

work page arXiv 2023

[40] [40]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Liang, W., Mao, Y., Kwon, Y., Yang, X., Zou, J.: Accuracy on the curve: On the nonlinear correlation of ML performance between data subpopulations. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceed- ings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 20...

work page 2023

[41] [41]

In: International Conference on Learning Representations (ICLR) (2014),https://openreview.net/forum?id= ylE6yojDR5yqX

Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on Learning Representations (ICLR) (2014),https://openreview.net/forum?id= ylE6yojDR5yqX

work page 2014

[42] [42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Lin, Y., Dong, H., Wang, H., Zhang, T.: Bayesian invariant risk minimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16021–16030 (June 2022)

work page 2022

[43] [43]

In: International Conference on Machine Learning (ICML) (2021)

Liu, S., Beery, S., Teney, D., Liu, S., van den Hengel, A., Gould, S.: Just train twice: Improving group robustness without training group information. In: International Conference on Machine Learning (ICML) (2021)

work page 2021

[44] [44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021),https : / / openaccess

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021),https : / / openaccess . thecvf . com / content / ICCV2021 / html / Liu _ Swin _ Transformer _ Hierarchical _ Vision _ Tr...

work page 2021

[45] [45]

Deep Learning Face Attributes in the Wild

Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015),https://arxiv.org/abs/1411.7766 18 Chew and Wang

work page internal anchor Pith review Pith/arXiv arXiv 2015

[46] [46]

Locatello, F., Tschannen, M., Bauer, S., Rätsch, G., Schölkopf, B., Bachem, O.: Disentangling factors of variation using few labels (2020),https://arxiv.org/ abs/1905.01258

work page arXiv 2020

[47] [47]

Lopez-Paz, D.: From dependence to causation (2016)

work page 2016

[48] [48]

Lynch, A., Dovonon, G.J.S., Kaddour, J., Silva, R.: Spawrious: A benchmark for fine control of spurious correlation biases (2023)

work page 2023

[49] [49]

Marcus, G.: Deep learning: A critical appraisal (2018)

work page 2018

[50] [50]

Mathieu, E., Rainforth, T., Siddharth, N., Teh, Y.W.: Disentangling disentangle- ment in variational autoencoders (2019),https://arxiv.org/abs/1812.02833

work page internal anchor Pith review Pith/arXiv arXiv 2019

[51] [51]

Tesseract: A search-based decoder for quantum error correction.arXiv preprint arXiv:2503.10988, 2025

Nagarajan, V., Andreassen, A., Neyshabur, B.: Understanding the failure modes of out-of-distribution generalization (2020).https://doi.org/10.48550/ARXIV. 2010.15775,https://arxiv.org/abs/2010.15775

work page internal anchor Pith review doi:10.48550/arxiv 2020

[52] [52]

Nam,J.,Cha,H.,Ahn,S.,Lee,J.,Shin,J.:Learningfromfailure:Trainingdebiased classifier from biased classifier (2020)

work page 2020

[53] [53]

Pagliardini, M., Jaggi, M., Fleuret, F., Karimireddy, S.P.: Agree to disagree: Di- versity through disagreement for better transferability (2022)

work page 2022

[54] [54]

Pearl, J.: The do-calculus revisited (2012),https://arxiv.org/abs/1210.4852

work page internal anchor Pith review Pith/arXiv arXiv 2012

[55] [55]

Peters, J., Bühlmann, P., Meinshausen, N.: Causal inference using invariant pre- diction: identification and confidence intervals (2015)

work page 2015

[56] [56]

Gradient Starvation:

Pezeshki, M., Kaba, S., Bengio, Y., Courville, A.C., Precup, D., Lajoie, G.: Gradi- ent starvation: A learning proclivity in neural networks. CoRRabs/2011.09468 (2020),https://arxiv.org/abs/2011.09468

work page arXiv 2011

[57] [57]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Qiu, S., Potapczynski, A., Izmailov, P., Wilson, A.G.: Simple and fast group ro- bustness by automatic feature reweighting. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th Inter- national Conference on Machine Learning. Proceedings of Machine Learning Re- search, vol. 202, pp. 28448–28467. PM...

work page 2023

[58] [58]

In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=ryxGuJrFvS

Sagawa*, S., Koh*, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neu- ral networks. In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=ryxGuJrFvS

work page 2020

[59] [59]

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR) (2018),https: //openaccess.thecvf.com/content_cvpr_2018/html/Sandler_MobileNetV2_ Inverted_Residuals_CVPR_2018_paper.html

work page 2018

[60] [60]

doi:10.1007/s11263-019-01228-7

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- CAM: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision128(2), 336–359 (oct 2019).https: //doi.org/10.1007/s11263-019-01228-7,https://doi.org/10.1007/s11263- 019-01228-7

work page doi:10.1007/s11263-019-01228-7 2019

[61] [61]

In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

Shah, H., Tamuly, K., Raghunathan, A., Jain, P., Netrapalli, P.: The pitfalls of simplicity bias in neural networks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 9573–9585. Curran Associates, Inc. (2020),https://proceedings. neurips.cc/paper/2020/file/6cfe0e6127fa2...

work page 2020

[62] [62]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015) Deep Attention Reweighting 19

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015) Deep Attention Reweighting 19

work page 2015

[63] [63]

Taghanaki, S.A., Khani, A., Khani, F., Gholami, A., Tran, L., Mahdavi-Amiri, A., Hamarneh, G.: Masktune: Mitigating spurious correlations by forcing to explore (2022)

work page 2022

[64] [64]

In: International Conference on Machine Learning (ICML) (2019), https://proceedings.mlr.press/v97/tan19a.html

Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neu- ral networks. In: International Conference on Machine Learning (ICML) (2019), https://proceedings.mlr.press/v97/tan19a.html

work page 2019

[65] [65]

In: Interna- tional Conference on Machine Learning (ICML) (2021),https://proceedings

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J’egou, H.: Train- ing data-efficient image transformers & distillation through attention. In: Interna- tional Conference on Machine Learning (ICML) (2021),https://proceedings. mlr.press/v139/touvron21a.html

work page 2021

[66] [66]

Träuble, F., Creager, E., Kilbertus, N., Locatello, F., Dittadi, A., Goyal, A., Schölkopf, B., Bauer, S.: On disentangled representations learned from correlated data (2021),https://arxiv.org/abs/2006.07886

work page arXiv 2021

[67] [67]

In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017),https://proceedings.neurips....

work page 2017

[68] [68]

Wang, T., Zhou, C., Sun, Q., Zhang, H.: Causal attention for unbiased visual recognition (2021),https://arxiv.org/abs/2108.08782

work page arXiv 2021

[69] [69]

Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions (2021),https://arxiv.org/abs/2102.12122

work page arXiv 2021

[70] [70]

In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR)

Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR). pp. 7794–7803 (2018),https://openaccess.thecvf.com/content_ cvpr_2018/html/Wang_Non-Local_Neural_Networks_CVPR_2018_paper.html

work page 2018

[71] [71]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 9677–9696 (2024).https://doi.org/10.1109/TPAMI.2024.3420937

Wang, X., Chen, H., Tang, S., Wu, Z., Zhu, W.: Disentangled representation learn- ing. IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 9677–9696 (2024).https://doi.org/10.1109/TPAMI.2024.3420937

work page doi:10.1109/tpami.2024.3420937 2024

[72] [72]

Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-ucsd birds 200. Tech. Rep. CNS-TR-2010-001, California Institute of Technology (2010)

work page 2010

[73] [73]

J., 2022, in Bambi C., Santangelo A., eds, , Handbook of X-ray and Gamma-ray Astrophysics

Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block atten- tion module. In: Computer Vision – ECCV 2018. pp. 3–19 (2018).https:// doi.org/10.1007/978- 3- 030- 01234- 2_1,https://openaccess.thecvf.com/ content _ ECCV _ 2018 / html / Sanghyun _ Woo _ Convolutional _ Block _ Attention _ Module_ECCV_2018_paper.html

work page doi:10.1007/978- 2018

[74] [74]

Wu, B., Xu, C., Dai, X., Wan, A., Zhang, P., Yan, Z., Tomizuka, M., Gonzalez, J., Keutzer, K., Vajda, P.: Visual transformers: Token-based image representation and processing for computer vision (2020),https://arxiv.org/abs/2006.03677

work page arXiv 2020

[75] [75]

Yang, X., Zhang, H., Qi, G., Cai, J.: Causal attention for vision-language tasks (2021),https://arxiv.org/abs/2103.03493

work page arXiv 2021

[76] [76]

In: Proceedings of the 40th International Conference on Machine Learning

Yang, Y., Zhang, H., Katabi, D., Ghassemi, M.: Change is hard: a closer look at subpopulation shift. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)

work page 2023

[77] [77]

Ye, W., Zheng, G., Cao, X., Ma, Y., Hu, X., Zhang, A.: Spurious correlations in machine learning: A survey (2024) 20 Chew and Wang

work page 2024

[78] [78]

Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., Tay, F.E., Feng, J., Yan, S.: Tokens-to-token vit: Training vision transformers from scratch on imagenet (2021),https://arxiv.org/abs/2101.11986

work page arXiv 2021

[79] [79]

In: 2024 7th International Conference on Artificial Intelli- gence and Big Data (ICAIBD)

Yue, D., Zou, J., Jin, X., Leng, T.: Causal inference for confounder-purify vi- sion transformers. In: 2024 7th International Conference on Artificial Intelli- gence and Big Data (ICAIBD). pp. 530–537 (2024).https://doi.org/10.1109/ ICAIBD62003.2024.10604648

work page arXiv 2024

[80] [80]

In: International Conference on Machine Learning (ICML) (2022),https : / / proceedings

Zhang, M., Jia, R., Misra, D.: Correct-n-contrast: A contrastive approach for improving robustness to spurious correlations. In: International Conference on Machine Learning (ICML) (2022),https : / / proceedings . mlr . press / v162 / zhang22c.html

work page 2022