arxiv: 2605.03091 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Attribution-Guided Masking for Robust Cross-Domain Sentiment Classification

Shubham Harkare , Arvind Yogesh Suresh Babu , Yash Kulkarni

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords attribution-guided maskingcross-domain sentiment classificationzero-shot transferspurious tokensgradient attributiondomain adaptationtransformer fine-tuningout-of-domain generalization

0 comments

The pith

Attribution-Guided Masking reduces reliance on domain-specific tokens to improve zero-shot sentiment classification across domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pre-trained transformers lose accuracy on out-of-domain sentiment data because they over-rely on spurious tokens such as mentions, hashtags, and slang. The paper shows that simple post-hoc attribution checks do not forecast this drop, so it introduces a training-time fix: Attribution-Guided Masking. The method computes token attributions from gradients, then adds a loss that penalizes the highest-scoring tokens during fine-tuning, optionally combined with a contrastive term for invariance. This runs without any target-domain labels or annotations. A reader should care because the resulting models stay competitive on hard transfers and reveal exactly which features cause the failures.

Core claim

Attribution-Guided Masking (AGM) dynamically detects and penalizes highly attributed spurious tokens during fine-tuning with a gradient-based attribution masking loss L_mask. When optionally paired with a counterfactual contrastive loss, AGM improves zero-shot cross-domain sentiment performance across four domains. On the hardest transfer to Sentiment140 it reaches a delta of 0.244, competitive with DANN, DRO, Fish, and IRM, while uniquely exposing through attributions which tokens drive the generalization gap. Ablations confirm that the attribution-guided component, not random masking, is what delivers the gains.

What carries the argument

The gradient-based attribution masking loss L_mask, which identifies tokens with high attribution scores during fine-tuning and applies penalties to reduce their contribution to predictions.

If this is right

AGM matches or approaches five strong baselines on the hardest zero-shot transfer while adding token-level explanations for failures.
Qualitative results show suppressed attributions on spurious tokens and increased use of domain-invariant sentiment markers.
Removing the attribution component or swapping it for random selection consistently hurts performance on difficult transfers.
The full method requires no target labels or human annotation yet still narrows the generalization gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The token-level attributions produced by AGM could serve as a diagnostic tool to audit models for unintended domain biases in related classification tasks.
Because the masking operates at training time, the same attribution signal might be reused to adapt models incrementally as new domains appear.
If the same gradient-attribution patterns recur across tasks, the approach could generalize beyond sentiment to other sequence-labeling problems with domain shift.

Load-bearing premise

Dynamically penalizing tokens with high gradient attribution during fine-tuning will reliably suppress domain-specific spurious features and improve out-of-domain performance without target-domain supervision or human annotation.

What would settle it

If AGM does not lower attribution on domain-specific tokens such as @mentions and hashtags while also failing to raise out-of-domain accuracy above random-masking controls on Sentiment140, the claim that attribution guidance is the effective mechanism would be falsified.

Figures

Figures reproduced from arXiv: 2605.03091 by Arvind Yogesh Suresh Babu, Shubham Harkare, Yash Kulkarni.

**Figure 1.** Figure 1: AGM training pipeline. Dashed arrow indi view at source ↗

**Figure 2.** Figure 2: Token-level attribution heatmaps on Sentiment140 test examples. Left: vanilla RoBERTa concentrates view at source ↗

read the original abstract

While pre-trained Transformer models achieve high accuracy on in-domain sentiment classification, they frequently experience severe performance degradation when transferring to out-of-domain data. We hypothesize that this generalization gap is driven by reliance on domain-specific spurious tokens. After demonstrating that post-hoc-token-level attribution drift fails to predict this gap, we propose Attribution-Guided Masking (AGM), a training time intervention that dynamically detects and penalizes highly attributed spurious tokens during fine-tuning. AGM's core component is a gradient based attribution masking loss ($\mathcal{L}_{mask}$), which can optionally be combined with a counterfactual contrastive loss to enforce domain-invariant representations, all without requiring target-domain labels or human annotation. Evaluated in a strict zero-shot transfer setting across four diverse domains with eight random seeds, AGM achieves competitive generalization compared to five strong baselines on the hardest transfer (Sentiment140): $\Delta$ = 0.244 versus DANN (0.264), DRO (0.248), Fish (0.247), and IRM (0.238), while uniquely providing token-level interpretability into which features drive the generalization gap. Our qualitative analysis confirms that AGM suppresses attribution on domain-specific tokens such as @mentions, hashtags, and slang, shifting reliance toward domain-invariant sentiment markers. Our ablation study further confirms that attribution-guided masking is the critical component: removing it or replacing it with random token selection consistently degrades performance on difficult transfers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AGM tries to use dynamic gradient attributions for masking during fine-tuning to cut spurious tokens in cross-domain sentiment, but the post-hoc finding creates a real tension in the motivation.

read the letter

The main takeaway is that this paper puts forward Attribution-Guided Masking as a label-free way to improve zero-shot transfer for Transformer sentiment classifiers by penalizing tokens that receive high gradient attributions while fine-tuning on source data, with an optional contrastive term for invariance. They evaluate on four domains and show competitive numbers on the toughest shift to Sentiment140, along with an ablation that credits the guided masking over random selection and some qualitative checks on suppressed tokens like mentions and slang.

Referee Report

3 major / 2 minor

Summary. The paper proposes Attribution-Guided Masking (AGM) to address performance degradation in cross-domain sentiment classification with pre-trained Transformers. It hypothesizes that generalization gaps arise from reliance on domain-specific spurious tokens. After showing that post-hoc token-level attribution drift does not predict the gap, the authors introduce a gradient-based attribution masking loss L_mask during fine-tuning (optionally combined with a counterfactual contrastive loss) that penalizes high-attribution tokens without target-domain labels or annotations. In zero-shot transfer experiments across four domains with eight random seeds, AGM achieves competitive results on the hardest transfer (Sentiment140) with Δ=0.244, outperforming DANN, DRO, and Fish while trailing IRM slightly, and provides token-level interpretability by suppressing attributions on mentions, hashtags, and slang in favor of invariant markers. An ablation confirms that attribution-guided masking (vs. random masking) is critical for the gains.

Significance. If the central results hold, AGM provides a label-free training intervention that improves OOD generalization competitively with domain-adaptation baselines while adding unique token-level interpretability into the sources of the generalization gap. The ablation isolating the attribution component over random masking is a strength, as is the explicit demonstration that post-hoc drift is not predictive.

major comments (3)

[Abstract] Abstract: The paper states that post-hoc token-level attribution drift fails to predict the generalization gap, yet AGM relies on gradient attributions computed during source-only fine-tuning to identify and penalize spurious tokens via L_mask. This creates an unresolved tension: if static attributions are not predictive of the gap, it is unclear why dynamic penalization during training will selectively suppress domain-specific features rather than useful ones or act as generic regularization. A concrete justification or additional analysis is needed to establish that the attribution signal is causally responsible for the observed gains.
[Results] Results and ablation (as described in abstract): The reported Δ=0.244 on Sentiment140 is competitive but trails IRM (0.238), and the claim that AGM 'achieves competitive generalization' requires error bars, statistical tests, or exact data splits across the eight seeds to confirm the differences are reliable rather than within noise. Without these, the support for superiority over the listed baselines remains limited.
[Method] Method (L_mask definition): The core assumption that high gradient attribution tokens during fine-tuning are reliably spurious (and that penalizing them improves OOD performance) lacks a quantitative link to gap closure. The qualitative suppression of @mentions/hashtags/slang is plausible but does not demonstrate that the masking selectively targets features driving the measured Δ rather than correlating with it; an analysis correlating attribution changes with per-token contribution to the gap would strengthen this.

minor comments (2)

[Abstract] The abstract and text would benefit from explicit definition of Δ (e.g., is it error rate, accuracy drop, or another metric) and clarification on whether lower values indicate better generalization.
[Method] Notation for the optional counterfactual contrastive loss should be introduced with an equation number for clarity when discussing its combination with L_mask.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important points about the motivation, empirical rigor, and causal claims in our work. We address each major comment below and will incorporate clarifications and additional analyses in the revised manuscript.

read point-by-point responses

Referee: [Abstract] The paper states that post-hoc token-level attribution drift fails to predict the generalization gap, yet AGM relies on gradient attributions computed during source-only fine-tuning to identify and penalize spurious tokens via L_mask. This creates an unresolved tension: if static attributions are not predictive of the gap, it is unclear why dynamic penalization during training will selectively suppress domain-specific features rather than useful ones or act as generic regularization.

Authors: We agree there is an important distinction to clarify. The post-hoc analysis examines attributions on a fully converged model that has already overfit to spurious tokens, where drift no longer correlates with the gap because the damage is done. In contrast, AGM computes attributions dynamically on the current model state during fine-tuning and intervenes before the model locks in reliance on those tokens. This early guidance allows the optimizer to favor invariant features from the outset, which is why the ablation shows attribution-guided masking outperforms random masking. We will revise the abstract and add a dedicated paragraph in Section 3 explaining this temporal difference, along with a plot of attribution evolution over training steps to illustrate the effect. revision: partial
Referee: [Results] The reported Δ=0.244 on Sentiment140 is competitive but trails IRM (0.238), and the claim that AGM 'achieves competitive generalization' requires error bars, statistical tests, or exact data splits across the eight seeds to confirm the differences are reliable rather than within noise.

Authors: We accept this point. Although AGM outperforms DANN, DRO, and Fish and is within 0.006 of IRM, we did not report standard deviations or significance tests. In the revision we will add mean ± std across the eight seeds for all methods, specify the exact train/validation/test splits used (standard splits from the respective datasets), and include paired t-tests or Wilcoxon tests with p-values to assess whether observed differences are statistically reliable. This will allow readers to judge competitiveness more precisely. revision: yes
Referee: [Method] The core assumption that high gradient attribution tokens during fine-tuning are reliably spurious (and that penalizing them improves OOD performance) lacks a quantitative link to gap closure. The qualitative suppression of @mentions/hashtags/slang is plausible but does not demonstrate that the masking selectively targets features driving the measured Δ rather than correlating with it; an analysis correlating attribution changes with per-token contribution to the gap would strengthen this.

Authors: We acknowledge that the current evidence is primarily qualitative and ablation-based. The ablation isolating attribution-guided masking from random masking provides indirect support that the signal is not generic regularization, and the qualitative examples show suppression of known spurious tokens. However, a direct per-token correlation between attribution magnitude and each token's causal contribution to the domain gap would require new experiments (e.g., token-level interventions or leave-one-token-out gap measurements) that are beyond the scope of the present study. We will add a limitations paragraph discussing this gap and propose such an analysis as future work, while retaining the existing ablation and qualitative results as supporting evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method with external evaluation

full rationale

The paper introduces AGM as an algorithmic fine-tuning procedure that adds a gradient-based L_mask term to penalize high-attribution tokens, optionally combined with a contrastive loss, then measures zero-shot transfer accuracy on held-out target domains against five independent baselines (DANN, DRO, Fish, IRM, etc.). All reported deltas, ablations, and qualitative token analyses are computed from external test sets and random seeds; no equation, performance number, or central claim reduces by construction to a quantity fitted from the same run or to a self-citation whose content is presupposed. The post-hoc attribution observation is presented as motivation, not as a derived prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that gradient attributions during fine-tuning can serve as a reliable proxy for identifying domain-specific spurious tokens without any target-domain signal.

axioms (1)

domain assumption Gradient-based attributions computed on source-domain data can identify tokens that will cause generalization failure on unseen target domains.
This premise underpins the design of the L_mask loss and the decision to penalize high-attribution tokens.

pith-pipeline@v0.9.0 · 5561 in / 1234 out tokens · 34889 ms · 2026-05-08T18:25:59.603974+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AGM's core component is a gradient based attribution masking loss (L_mask)... L_mask = (1/|M|) Σ (a_i)² ... λ_1 = λ_2 = 0.1, τ_high = 0.75

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Martin Arjovsky, L \'e on Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. Invariant risk minimization. arXiv preprint arXiv:1907.02893

work page internal anchor Pith review arXiv 2019
[2]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)

2019
[3]

Enelpol. 2024. Booking.com reviews dataset. https://huggingface.co/datasets/enelpol/booking_com_reviews

2024
[4]

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran c ois Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1--35

2016
[5]

Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12)

2009
[6]

Bowman, and Noah A

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 107--112

2018
[7]

Daniel Kuhn, Soroosh Shafiee, and Wolfram Wiesemann. 2024. Distributionally robust optimization. arXiv preprint arXiv:2411.02549

work page arXiv 2024
[8]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692

work page internal anchor Pith review arXiv 2019
[9]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL), pages 142--150

2011
[10]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K \"o pf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, and 2 others. 2019. PyTorch : An imperative style, high-performance de...

2019
[11]

Elan Rosenfeld, Pradeep Ravikumar, and Andrej Risteski. 2021. The risks of invariant risk minimization. In International Conference on Learning Representations (ICLR)

2021
[12]

Hughes, and Finale Doshi-Velez

Andrew Slavin Ross, Michael C. Hughes, and Finale Doshi-Velez. 2017. Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI)

2017
[13]

Hashimoto, and Percy Liang

Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. 2020. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR)

2020
[14]

Yuge Shi, Jeffrey Seely, Philip H. S. Torr, N. Siddharth, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. 2022. Gradient matching for domain generalization. In International Conference on Learning Representations (ICLR)

2022
[15]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML)

2017
[16]

Lifu Tu, Garima Lalwani, Spandana Gella, and He He. 2020. An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621--633

2020
[17]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R \'e mi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. Transformers: State-of-the-art natural language processing. ...

2020
[18]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, 28

2015