pith. sign in

arxiv: 2604.20027 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Cognitive Alignment At No Cost: Inducing Human Attention Biases For Interpretable Vision Transformers

Pith reviewed 2026-05-10 02:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision transformershuman saliencyattention alignmentmodel interpretabilityfine-tuningself-attentioncognitive biasessaliency metrics
0
0 comments X

The pith

Fine-tuning vision transformers on human saliency maps induces human-like attention biases at no cost to accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether vision transformers can be aligned with human attentional patterns by fine-tuning their self-attention weights on maps of where people look at images. A shuffled control isolates whether the gains come from meaningful semantic signals rather than generic supervision. The tuned models improve on five saliency metrics and develop three human-like biases, including a reversal from large-object to small-object preference, stronger animacy focus, and reduced attention entropy. The same changes leave classification performance unchanged on ImageNet, ImageNet-C, and ObjectNet according to Bayesian analysis, whereas the identical procedure degrades both alignment and accuracy in a ResNet-50.

Core claim

Fine-tuning the self-attention weights of ViT-B/16 on human saliency fixation maps, compared against a shuffled control, produces significantly higher alignment across five saliency metrics and induces three hallmark human biases: reversal of the baseline large-object preference toward small objects, amplified animacy preference, and lowered attention entropy. Bayesian parity analysis supplies decisive to very-strong evidence that these changes leave classification accuracy intact on ImageNet, ImageNet-C, and ObjectNet. The same procedure applied to ResNet-50 instead reduces both alignment and accuracy, indicating that the modular self-attention mechanism uniquely permits dissociation of spa

What carries the argument

Fine-tuning of ViT self-attention weights on human saliency fixation maps, isolated via comparison to a shuffled control.

Load-bearing premise

The human saliency fixation maps capture semantically relevant signals that the shuffled control can separate from nonspecific supervision effects.

What would settle it

Retraining with the shuffled maps yields the same alignment gains and bias changes as the real maps, or classification accuracy drops on ImageNet, ImageNet-C, or ObjectNet after fine-tuning with the real maps.

Figures

Figures reproduced from arXiv: 2604.20027 by Ethan Knights.

Figure 1
Figure 1. Figure 1: Methods for ViT Fine-Tuning. (A) Model Variants and Training Frame [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Model-Human Attentional Alignment. (A) Saliency Alignment. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Acquisition of Human-Like Cognitive Biases. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Benchmarks. (A) Classification Accuracy. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: ResNet-50 Replication. (A) Methods for CNN Fine-Tuning. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

For state-of-the-art image understanding, Vision Transformers (ViTs) have become the standard architecture but their processing diverges substantially from human attentional characteristics. We investigate whether this cognitive gap can be shrunk by fine-tuning the self-attention weights of Google's ViT-B/16 on human saliency fixation maps. To isolate the effects of semantically relevant signals from generic human supervision, the tuned model is compared against a shuffled control. Fine-tuning significantly improved alignment across five saliency metrics and induced three hallmark human-like biases: tuning reversed the baseline's anti-human large-object bias toward small-objects, amplified the animacy preference and diminished extreme attention entropy. Bayesian parity analysis provides decisive to very-strong evidence that this cognitive alignment comes at no cost to the model's original classification performance on in- (ImageNet), corrupted (ImageNet-C) and out-of-distribution (ObjectNet) benchmarks. An equivalent procedure applied to a ResNet-50 Convolutional Neural Network (CNN) instead degraded both alignment and accuracy, suggesting that the ViT's modular self-attention mechanism is uniquely suited for dissociating spatial priority from representational logic. These findings demonstrate that biologically grounded priors can be instilled as a free emergent property of human-aligned attention, to improve transformer interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that fine-tuning the self-attention layers of ViT-B/16 on human saliency fixation maps improves alignment with human attention patterns across five saliency metrics, induces three human-like biases (reversal of anti-human large-object bias toward small objects, amplified animacy preference, and reduced attention entropy), and that this alignment incurs no cost to classification accuracy. This is supported by comparison to a shuffled control (to isolate semantic signals) and Bayesian parity analysis showing decisive-to-very-strong evidence of equivalent performance on ImageNet, ImageNet-C, and ObjectNet. The same procedure applied to ResNet-50 degrades both alignment and accuracy, suggesting ViT self-attention is uniquely suited for this dissociation.

Significance. If the empirical results and controls hold, the work is significant for demonstrating that biologically grounded attention priors can be instilled in transformers as an emergent property that enhances interpretability without performance trade-offs on in-distribution, corrupted, and out-of-distribution benchmarks. The architectural contrast with CNNs and the use of a shuffled control provide a concrete way to separate generic supervision from semantically relevant human biases, with potential implications for more human-aligned vision models.

major comments (2)
  1. [Bayesian parity analysis (abstract and results)] The central 'at no cost' claim rests on Bayesian parity analysis providing decisive-to-very-strong evidence of equivalent accuracy across ImageNet, ImageNet-C, and ObjectNet. However, the abstract (and presumably the corresponding results section) reports neither the exact model specification, prior choices (e.g., normal vs. Cauchy on the performance difference), ROPE width, numerical Bayes factor values, nor any sensitivity/robustness checks. Equivalence Bayes factors are known to be sensitive to these choices; without them the strength of evidence cannot be verified and the claim does not fully follow from the reported data.
  2. [Methods (shuffled control description)] The shuffled control is presented as isolating semantically relevant signals from generic human supervision, yet the manuscript provides no quantitative comparison of how well the shuffled maps preserve low-level statistics (e.g., center bias, entropy) versus the original fixation maps. If the control fails to fully match these statistics, the attribution of bias induction specifically to semantic content is weakened.
minor comments (2)
  1. [Abstract] The abstract states that five saliency metrics were used but does not name them; explicitly listing the metrics (e.g., NSS, CC, etc.) would improve immediate readability.
  2. [Abstract] The claim that the procedure 'reversed the baseline's anti-human large-object bias' would benefit from a brief quantitative statement of the effect size or statistical test in the abstract or early results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each of the major comments in detail below.

read point-by-point responses
  1. Referee: [Bayesian parity analysis (abstract and results)] The central 'at no cost' claim rests on Bayesian parity analysis providing decisive-to-very-strong evidence of equivalent accuracy across ImageNet, ImageNet-C, and ObjectNet. However, the abstract (and presumably the corresponding results section) reports neither the exact model specification, prior choices (e.g., normal vs. Cauchy on the performance difference), ROPE width, numerical Bayes factor values, nor any sensitivity/robustness checks. Equivalence Bayes factors are known to be sensitive to these choices; without them the strength of evidence cannot be verified and the claim does not fully follow from the reported data.

    Authors: We acknowledge that the manuscript does not provide the full specification of the Bayesian parity analysis. We will revise the Methods and Results sections to include the exact model specification, prior choices (normal prior on the performance difference), ROPE width, numerical Bayes factor values, and sensitivity/robustness checks under alternative priors. This will allow verification of the evidence strength for the equivalence claim. revision: yes

  2. Referee: [Methods (shuffled control description)] The shuffled control is presented as isolating semantically relevant signals from generic human supervision, yet the manuscript provides no quantitative comparison of how well the shuffled maps preserve low-level statistics (e.g., center bias, entropy) versus the original fixation maps. If the control fails to fully match these statistics, the attribution of bias induction specifically to semantic content is weakened.

    Authors: We agree that quantifying the preservation of low-level statistics in the shuffled control would strengthen the claim that it isolates semantic content. We will add this comparison to the Methods section, reporting metrics such as center bias and entropy for the original versus shuffled maps to demonstrate that the procedure primarily disrupts semantic alignment while largely preserving spatial statistics. revision: yes

Circularity Check

0 steps flagged

Empirical fine-tuning and Bayesian evaluation of alignment vs. accuracy is self-contained

full rationale

The paper describes an experimental procedure: fine-tuning ViT-B/16 self-attention on human saliency maps, comparison to a shuffled control baseline, measurement of alignment metrics, and direct accuracy evaluation on ImageNet/ImageNet-C/ObjectNet. Bayesian parity analysis is applied post-hoc to the observed performance differences. No equations, parameters, or claims reduce by construction to their own inputs; the shuffled control supplies an independent contrast not derived from the success metrics. No self-citation chains, ansatzes, or renamings of known results are load-bearing for the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate specific free parameters, axioms, or invented entities. The fine-tuning procedure itself likely involves standard hyperparameters (learning rate, epochs) that are not reported here; no new physical or mathematical entities are postulated.

pith-pipeline@v0.9.0 · 5514 in / 1310 out tokens · 33142 ms · 2026-05-10T02:02:43.375869+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., & Kim, B. (2020). Sanity checks for saliency maps. https://doi.org/10.48550/arXiv.1810.03292 Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Gar- cia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., & Herrera, F. (2020). Explainable Artif...

  2. [2]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248–255. Diaz, L. T., & Alvarez, G. A. Ventral Stream Responses to Inanimate Objects are Equally Aligned with AlexNet (2012) and Modern Deep Neural Networks.https://2025.ccneuro.org/ abstract_pdf/Diaz_2025_Ventral_Stream_R...

  3. [3]

    P., & Tipper, S

    Frischen, A., Bayliss, A. P., & Tipper, S. P. (2007). Gaze cueing of attention: visual attention, social cognition, and individual differences.Psychological bulletin,133(4),

  4. [4]

    A., & Brendel, W

    Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2018, November). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International conference on learning representations. Geva, M., Schuster, R., Berant, J., & Levy, O. (2021, November). Transformer feed-forward la...

  5. [5]

    L., Kumar, S., Sun, X., Mittal, A.,

    Koorathota, S., Papadopoulos, N., Ma, J. L., Kumar, S., Sun, X., Mittal, A., ... & Sajda, P. (2023). Fixating on attention: Integrating human eye tracking into vision transformers. arXiv preprint arXiv:2308.13969. Krajbich I., Armel C., Rangel A. (2010). Visual fixations and the computation and comparison of value in simple choice. Nat. Neurosci. 13 1292–...

  6. [6]

    LeCun, Y., & Bengio, Y. (1998). Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., ... &Gao, J.(2020, August). Oscar: Object- semantics aligned pre-training for vision-language tasks. InEuropean conference on computer vision(pp. 121-137). Cham...

  7. [7]

    J., Iyer, A., Itti, L., & Koch, C

    Peters, R. J., Iyer, A., Itti, L., & Koch, C. (2005). Components of bottom-up gaze allocation in natural images. Vision research, 45(18), 2397-2416. Petersen, S. E., & Posner, M. I. (2012). The attention system of the human brain: 20 years after.Annual review of neuroscience,35, 73-89. Piñero, L. G. O., Carrasco, M., Aranda, J., & González, C. (2025). Com...

  8. [8]

    Attention is All you Need

    Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L.u., Polosukhin I. Attention is All you Need. In: Guyon I., Luxburg U.V., Bengio S., Wallach H., Fergus R., Vishwanathan S., Garnett R., editors. Advances in Neural Information Processing Systems. Vol. 30 Curran Associates, Inc.; Red Hook, NY, USA: 2017 Walther, D., & Koch, C. ...