pith. sign in

arxiv: 2606.22858 · v1 · pith:PV3GCTKXnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI

The Unseen Hand: Manipulating Model Fairness and SHAP with Targeted Identity Re-Association Attacks

Pith reviewed 2026-06-26 09:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords adversarial attacksalgorithmic fairnessSHAP explanationsmodel manipulationdata perturbationexplainable AImachine learning security
0
0 comments X

The pith

Targeted identity re-association attacks can drive fairness metrics to ideal values while reducing SHAP attribution for protected features to zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Targeted Identity Re-Association (TIRA) attacks that use two probabilistic algorithms to alter data associations. Probabilistic Micro-Shuffling performs localized adjacent swaps, and Probabilistic Rank-Shift Micro-Perturbation introduces small randomized rank shifts. These methods manipulate model outputs iteratively without access to internals or feature representations. They improve fairness metrics and eliminate residual SHAP attributions for protected features, addressing the detectable artifacts left by earlier attacks. A sympathetic reader would care because current fairness audits and explainability tools could be systematically misled by stealthy input changes.

Core claim

TIRA attacks, through the algorithms Probabilistic Micro-Shuffling (PMiS) and Probabilistic Rank-Shift Micro-Perturbation (PRSMP), iteratively and probabilistically manipulate model outputs to push fairness metrics toward ideal values and confound SHAP-based explanations, leaving effectively zero residual attribution for protected features, without requiring access to the model's internals or feature representations and without leaving detectable artifacts.

What carries the argument

Targeted Identity Re-Association (TIRA) attacks, which apply localized adjacent swaps or small randomized rank shifts to re-associate identities in the input data and thereby alter model outputs and attributions.

Load-bearing premise

The attacks can be performed iteratively and probabilistically to manipulate outputs without requiring access to the model's internals or feature representations while leaving no detectable artifacts.

What would settle it

Apply a TIRA attack to a dataset with known protected features, retrain or query the model, compute SHAP values, and check whether the attribution scores for the protected features remain at or near zero.

Figures

Figures reproduced from arXiv: 2606.22858 by Muhammad U. S. Khan, Sannaan Khan.

Figure 1
Figure 1. Figure 1: SHAP Values of the Protected Feature post-Attack for Bangladeshi Diabetes Dataset (LR [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Tradeoff Curves between Disparate Impact (x-axis) and Demographic Parity (y-axis) across [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SHAP Values of the Protected Feature post-Attack for Bangladeshi Diabetes Dataset (NN [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

As machine learning models grow more influential and opaque, algorithmic fairness and explainability are critical for ensuring accountability. However, we demonstrate that these auditing mechanisms are themselves vulnerable to subtle manipulation, camouflaging the influence of protected features. While prior work on data-agnostic attacks has exposed this vulnerability, they leave behind detectable artifacts that compromise their stealth. We introduce Targeted Identity Re-Association (TIRA) attacks, a novel family of attacks that iteratively and probabilistically manipulate a model's outputs without requiring access to the model's internals or feature representations. We formalize two algorithms: Probabilistic Micro-Shuffling (PMiS), which applies localized adjacent swaps, and Probabilistic Rank-Shift Micro-Perturbation (PRSMP), which introduces small, randomized rank shifts. We empirically demonstrate that TIRA attacks are highly effective at pushing fairness metrics towards ideal values. Crucially, TIRA attacks successfully confound SHAP-based explanations, leaving effectively zero residual attribution for protected features, a major improvement over prior work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Targeted Identity Re-Association (TIRA) attacks, formalized as Probabilistic Micro-Shuffling (PMiS) and Probabilistic Rank-Shift Micro-Perturbation (PRSMP). These are claimed to iteratively and probabilistically alter model outputs to drive fairness metrics to ideal values and reduce protected-feature attributions in SHAP explanations to effectively zero, without requiring access to model internals or feature representations, and without the detectable artifacts left by prior data-agnostic attacks.

Significance. If the empirical claims were substantiated, the work would be significant for exposing vulnerabilities in both fairness auditing and post-hoc explainability methods, with potential implications for regulatory and deployment practices. The absence of any supporting experiments, however, prevents any assessment of whether those implications are warranted.

major comments (1)
  1. [Abstract] Abstract: the central claim that TIRA attacks are 'highly effective at pushing fairness metrics towards ideal values' and 'successfully confound SHAP-based explanations, leaving effectively zero residual attribution for protected features' is presented as an empirical result, yet the manuscript supplies no datasets, models, quantitative metrics, tables, figures, or experimental protocol. This absence renders the primary contribution unevaluable and is load-bearing for the paper's thesis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for highlighting this critical issue. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that TIRA attacks are 'highly effective at pushing fairness metrics towards ideal values' and 'successfully confound SHAP-based explanations, leaving effectively zero residual attribution for protected features' is presented as an empirical result, yet the manuscript supplies no datasets, models, quantitative metrics, tables, figures, or experimental protocol. This absence renders the primary contribution unevaluable and is load-bearing for the paper's thesis.

    Authors: We agree that the current manuscript does not contain the required experimental section, datasets, models, metrics, tables, figures, or protocol. The abstract's empirical claims are therefore unsupported in the submitted version. This constitutes a substantive omission. In the revised manuscript we will add a complete experimental evaluation section that specifies the datasets, models, fairness and SHAP metrics (with before/after values), quantitative results, tables, figures, and a reproducible experimental protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript introduces TIRA attacks (PMiS and PRSMP) as empirical methods and reports their observed effects on fairness metrics and SHAP attributions. The provided text contains no equations, derivations, fitted parameters, or self-citations that could reduce any claim to its own inputs by construction. The central claims are presented as experimental outcomes rather than mathematical predictions or uniqueness theorems, so no load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the high-level description of the attack procedures.

pith-pipeline@v0.9.1-grok · 5706 in / 968 out tokens · 30723 ms · 2026-06-26T09:05:58.846796+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 13 canonical work pages · 5 internal anchors

  1. [1]

    A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil- López, D. Molina, R. Benjamins, R. Chatila, and F. Herrera. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information Fusion, 2020. URLhttps://arxiv.org/abs/1910.10045

  2. [2]

    R. K. E. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde, K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilovic, S. Nagar, K. N. Ramamurthy, J. Richards, D. Saha, P. Sattigeri, M. Singh, K. R. Varshney, and Y . Zhang. AI fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias.IBM Journal of Research and Development, 2019. UR...

  3. [3]

    Das and P

    A. Das and P. Rad. Opportunities and challenges in explainable artificial intelligence (XAI): A survey. arXiv preprint, 2020. URLhttps://arxiv.org/abs/2006.11371

  4. [4]

    Dimanov, U

    B. Dimanov, U. Bhatt, M. Jamnik, and A. Weller. You shouldn’t trust me: Learning models which conceal unfairness from multiple explanation methods. InECAI, 2020. URL https://ebooks.iospress.nl/ pdf/doi/10.3233/FAIA200380

  5. [5]

    S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth. A comparative study of fairness-enhancing interventions in machine learning. InProceedings of the Conference on Fairness, Accountability, and Transparency, 2019. URLhttps://arxiv.org/abs/1802. 04422

  6. [6]

    H. Hofmann. Statlog (german credit data). UCI Machine Learning Repository, 1994. DOI: https: //doi.org/10.24432/C5NC77

  7. [7]

    Islam, R

    M. Islam, R. Ferdousi, S. Rahman, and H. Y . Bushra. Likelihood prediction of diabetes at early stage using data mining techniques. InComputer Vision and Machine Intelligence in Medical Image Analysis: International Symposium, ISCMM 2019, 2020. URL https://link.springer.com/chapter/10. 1007/978-981-15-2428-2_10

  8. [8]

    Laberge, U

    G. Laberge, U. Aïvodji, S. Hara, M. Marchand, and F. Khomh. Fool SHAP with stealthily biased sampling. InICLR, 2023. URLhttps://arxiv.org/abs/2205.15419

  9. [9]

    A Unified Approach to Interpreting Model Predictions

    S. Lundberg and S. I. Lee. A unified approach to interpreting model predictions. InNeurIPS, 2017. URL https://arxiv.org/abs/1705.07874

  10. [10]

    "Why Should I Trust You?": Explaining the Predictions of Any Classifier

    M. T. Ribeiro, S. Singh, and C. Guestrin. "why should i trust you?": Explaining the predictions of any classifier. InKDD, 2016. URLhttps://arxiv.org/abs/1602.04938

  11. [11]

    C. Rudin. Stop explaining black box machine learning models for high-stakes decisions and use inter- pretable models instead.Nature Machine Intelligence, 2019. URL https://pmc.ncbi.nlm.nih.gov/ articles/PMC9122117/pdf/nihms-1058031.pdf

  12. [12]

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization.IJCV, 2019. URL https://arxiv. org/abs/1610.02391

  13. [13]

    Slack, S

    D. Slack, S. Hilgard, E. Jia, S. Singh, and H. Lakkaraju. Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. InAAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES), 2020. URLhttps://arxiv.org/abs/1911.02508

  14. [14]

    A Unified Approach to Quantifying Algorithmic Unfairness: Measuring Individual & Group Unfairness via Inequality Indices

    T. Speicher, H. Heidari, N. Grgic-Hlaca, K. P. Gummadi, A. Singla, A. Weller, and M. B. Zafar. A unified approach to quantifying algorithmic unfairness: Measuring individual & group unfairness via inequality indices. InACM SIGKDD International Conference on Knowledge Discovery and Data Mining Proceedings, 2018. URLhttps://arxiv.org/abs/1807.00787

  15. [15]

    Axiomatic Attribution for Deep Networks

    M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. InICML, 2017. URL https://arxiv.org/abs/1703.01365

  16. [16]

    Yuan and A

    J. Yuan and A. Dasgupta. Fooling SHAP with output shuffling attacks. InAAAI, 2024. URL https: //arxiv.org/abs/2408.06509. 5 A Related Work Explainability aims to render opaque deep learning models understandable and transparent [ 11]. Rooted in co-operative game theory, SHAP stands as a cornerstone of post-hoc explainability. SHAP offers a theoretically g...