Cognitive Alignment At No Cost: Inducing Human Attention Biases For Interpretable Vision Transformers

Ethan Knights

arxiv: 2604.20027 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Cognitive Alignment At No Cost: Inducing Human Attention Biases For Interpretable Vision Transformers

Ethan Knights This is my paper

Pith reviewed 2026-05-10 02:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision transformershuman saliencyattention alignmentmodel interpretabilityfine-tuningself-attentioncognitive biasessaliency metrics

0 comments

The pith

Fine-tuning vision transformers on human saliency maps induces human-like attention biases at no cost to accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether vision transformers can be aligned with human attentional patterns by fine-tuning their self-attention weights on maps of where people look at images. A shuffled control isolates whether the gains come from meaningful semantic signals rather than generic supervision. The tuned models improve on five saliency metrics and develop three human-like biases, including a reversal from large-object to small-object preference, stronger animacy focus, and reduced attention entropy. The same changes leave classification performance unchanged on ImageNet, ImageNet-C, and ObjectNet according to Bayesian analysis, whereas the identical procedure degrades both alignment and accuracy in a ResNet-50.

Core claim

Fine-tuning the self-attention weights of ViT-B/16 on human saliency fixation maps, compared against a shuffled control, produces significantly higher alignment across five saliency metrics and induces three hallmark human biases: reversal of the baseline large-object preference toward small objects, amplified animacy preference, and lowered attention entropy. Bayesian parity analysis supplies decisive to very-strong evidence that these changes leave classification accuracy intact on ImageNet, ImageNet-C, and ObjectNet. The same procedure applied to ResNet-50 instead reduces both alignment and accuracy, indicating that the modular self-attention mechanism uniquely permits dissociation of spa

What carries the argument

Fine-tuning of ViT self-attention weights on human saliency fixation maps, isolated via comparison to a shuffled control.

Load-bearing premise

The human saliency fixation maps capture semantically relevant signals that the shuffled control can separate from nonspecific supervision effects.

What would settle it

Retraining with the shuffled maps yields the same alignment gains and bias changes as the real maps, or classification accuracy drops on ImageNet, ImageNet-C, or ObjectNet after fine-tuning with the real maps.

Figures

Figures reproduced from arXiv: 2604.20027 by Ethan Knights.

**Figure 2.** Figure 2: Model-Human Attentional Alignment. (A) Saliency Alignment. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Acquisition of Human-Like Cognitive Biases. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Benchmarks. (A) Classification Accuracy. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: ResNet-50 Replication. (A) Methods for CNN Fine-Tuning. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

For state-of-the-art image understanding, Vision Transformers (ViTs) have become the standard architecture but their processing diverges substantially from human attentional characteristics. We investigate whether this cognitive gap can be shrunk by fine-tuning the self-attention weights of Google's ViT-B/16 on human saliency fixation maps. To isolate the effects of semantically relevant signals from generic human supervision, the tuned model is compared against a shuffled control. Fine-tuning significantly improved alignment across five saliency metrics and induced three hallmark human-like biases: tuning reversed the baseline's anti-human large-object bias toward small-objects, amplified the animacy preference and diminished extreme attention entropy. Bayesian parity analysis provides decisive to very-strong evidence that this cognitive alignment comes at no cost to the model's original classification performance on in- (ImageNet), corrupted (ImageNet-C) and out-of-distribution (ObjectNet) benchmarks. An equivalent procedure applied to a ResNet-50 Convolutional Neural Network (CNN) instead degraded both alignment and accuracy, suggesting that the ViT's modular self-attention mechanism is uniquely suited for dissociating spatial priority from representational logic. These findings demonstrate that biologically grounded priors can be instilled as a free emergent property of human-aligned attention, to improve transformer interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuning ViT self-attention on saliency maps induces human-like biases at no accuracy cost on three benchmarks, but the decisive Bayesian parity claim rests on unreported prior and ROPE details.

read the letter

The main takeaway is that fine-tuning only the self-attention layers of ViT-B/16 on human fixation maps improves alignment on five saliency metrics, flips the model's large-object bias toward small objects, boosts animacy preference, and lowers attention entropy, all while accuracy stays statistically unchanged on ImageNet, ImageNet-C, and ObjectNet. The shuffled saliency control isolates the effect from generic supervision, and the same procedure on ResNet-50 hurts both alignment and accuracy instead. This highlights how the modular attention in transformers lets you adjust spatial priority separately from the rest of the model.

Referee Report

2 major / 2 minor

Summary. The paper claims that fine-tuning the self-attention layers of ViT-B/16 on human saliency fixation maps improves alignment with human attention patterns across five saliency metrics, induces three human-like biases (reversal of anti-human large-object bias toward small objects, amplified animacy preference, and reduced attention entropy), and that this alignment incurs no cost to classification accuracy. This is supported by comparison to a shuffled control (to isolate semantic signals) and Bayesian parity analysis showing decisive-to-very-strong evidence of equivalent performance on ImageNet, ImageNet-C, and ObjectNet. The same procedure applied to ResNet-50 degrades both alignment and accuracy, suggesting ViT self-attention is uniquely suited for this dissociation.

Significance. If the empirical results and controls hold, the work is significant for demonstrating that biologically grounded attention priors can be instilled in transformers as an emergent property that enhances interpretability without performance trade-offs on in-distribution, corrupted, and out-of-distribution benchmarks. The architectural contrast with CNNs and the use of a shuffled control provide a concrete way to separate generic supervision from semantically relevant human biases, with potential implications for more human-aligned vision models.

major comments (2)

[Bayesian parity analysis (abstract and results)] The central 'at no cost' claim rests on Bayesian parity analysis providing decisive-to-very-strong evidence of equivalent accuracy across ImageNet, ImageNet-C, and ObjectNet. However, the abstract (and presumably the corresponding results section) reports neither the exact model specification, prior choices (e.g., normal vs. Cauchy on the performance difference), ROPE width, numerical Bayes factor values, nor any sensitivity/robustness checks. Equivalence Bayes factors are known to be sensitive to these choices; without them the strength of evidence cannot be verified and the claim does not fully follow from the reported data.
[Methods (shuffled control description)] The shuffled control is presented as isolating semantically relevant signals from generic human supervision, yet the manuscript provides no quantitative comparison of how well the shuffled maps preserve low-level statistics (e.g., center bias, entropy) versus the original fixation maps. If the control fails to fully match these statistics, the attribution of bias induction specifically to semantic content is weakened.

minor comments (2)

[Abstract] The abstract states that five saliency metrics were used but does not name them; explicitly listing the metrics (e.g., NSS, CC, etc.) would improve immediate readability.
[Abstract] The claim that the procedure 'reversed the baseline's anti-human large-object bias' would benefit from a brief quantitative statement of the effect size or statistical test in the abstract or early results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each of the major comments in detail below.

read point-by-point responses

Referee: [Bayesian parity analysis (abstract and results)] The central 'at no cost' claim rests on Bayesian parity analysis providing decisive-to-very-strong evidence of equivalent accuracy across ImageNet, ImageNet-C, and ObjectNet. However, the abstract (and presumably the corresponding results section) reports neither the exact model specification, prior choices (e.g., normal vs. Cauchy on the performance difference), ROPE width, numerical Bayes factor values, nor any sensitivity/robustness checks. Equivalence Bayes factors are known to be sensitive to these choices; without them the strength of evidence cannot be verified and the claim does not fully follow from the reported data.

Authors: We acknowledge that the manuscript does not provide the full specification of the Bayesian parity analysis. We will revise the Methods and Results sections to include the exact model specification, prior choices (normal prior on the performance difference), ROPE width, numerical Bayes factor values, and sensitivity/robustness checks under alternative priors. This will allow verification of the evidence strength for the equivalence claim. revision: yes
Referee: [Methods (shuffled control description)] The shuffled control is presented as isolating semantically relevant signals from generic human supervision, yet the manuscript provides no quantitative comparison of how well the shuffled maps preserve low-level statistics (e.g., center bias, entropy) versus the original fixation maps. If the control fails to fully match these statistics, the attribution of bias induction specifically to semantic content is weakened.

Authors: We agree that quantifying the preservation of low-level statistics in the shuffled control would strengthen the claim that it isolates semantic content. We will add this comparison to the Methods section, reporting metrics such as center bias and entropy for the original versus shuffled maps to demonstrate that the procedure primarily disrupts semantic alignment while largely preserving spatial statistics. revision: yes

Circularity Check

0 steps flagged

Empirical fine-tuning and Bayesian evaluation of alignment vs. accuracy is self-contained

full rationale

The paper describes an experimental procedure: fine-tuning ViT-B/16 self-attention on human saliency maps, comparison to a shuffled control baseline, measurement of alignment metrics, and direct accuracy evaluation on ImageNet/ImageNet-C/ObjectNet. Bayesian parity analysis is applied post-hoc to the observed performance differences. No equations, parameters, or claims reduce by construction to their own inputs; the shuffled control supplies an independent contrast not derived from the success metrics. No self-citation chains, ansatzes, or renamings of known results are load-bearing for the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate specific free parameters, axioms, or invented entities. The fine-tuning procedure itself likely involves standard hyperparameters (learning rate, epochs) that are not reported here; no new physical or mathematical entities are postulated.

pith-pipeline@v0.9.0 · 5514 in / 1310 out tokens · 33142 ms · 2026-05-10T02:02:43.375869+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., & Kim, B. (2020). Sanity checks for saliency maps. https://doi.org/10.48550/arXiv.1810.03292 Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Gar- cia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., & Herrera, F. (2020). Explainable Artif...

work page doi:10.48550/arxiv.1810.03292 2020
[2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248–255. Diaz, L. T., & Alvarez, G. A. Ventral Stream Responses to Inanimate Objects are Equally Aligned with AlexNet (2012) and Modern Deep Neural Networks.https://2025.ccneuro.org/ abstract_pdf/Diaz_2025_Ventral_Stream_R...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2012
[3]

P., & Tipper, S

Frischen, A., Bayliss, A. P., & Tipper, S. P. (2007). Gaze cueing of attention: visual attention, social cognition, and individual differences.Psychological bulletin,133(4),

2007
[4]

A., & Brendel, W

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2018, November). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International conference on learning representations. Geva, M., Schuster, R., Berant, J., & Levy, O. (2021, November). Transformer feed-forward la...

work page doi:10.1002/wcs.113 2018
[5]

L., Kumar, S., Sun, X., Mittal, A.,

Koorathota, S., Papadopoulos, N., Ma, J. L., Kumar, S., Sun, X., Mittal, A., ... & Sajda, P. (2023). Fixating on attention: Integrating human eye tracking into vision transformers. arXiv preprint arXiv:2308.13969. Krajbich I., Armel C., Rangel A. (2010). Visual fixations and the computation and comparison of value in simple choice. Nat. Neurosci. 13 1292–...

work page doi:10.1038/nn.263 2023
[6]

LeCun, Y., & Bengio, Y. (1998). Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., ... &Gao, J.(2020, August). Oscar: Object- semantics aligned pre-training for vision-language tasks. InEuropean conference on computer vision(pp. 121-137). Cham...

work page doi:10.1016/j.tics.2025.11.016 1998
[7]

J., Iyer, A., Itti, L., & Koch, C

Peters, R. J., Iyer, A., Itti, L., & Koch, C. (2005). Components of bottom-up gaze allocation in natural images. Vision research, 45(18), 2397-2416. Petersen, S. E., & Posner, M. I. (2012). The attention system of the human brain: 20 years after.Annual review of neuroscience,35, 73-89. Piñero, L. G. O., Carrasco, M., Aranda, J., & González, C. (2025). Com...

2005
[8]

Attention is All you Need

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L.u., Polosukhin I. Attention is All you Need. In: Guyon I., Luxburg U.V., Bengio S., Wallach H., Fergus R., Vishwanathan S., Garnett R., editors. Advances in Neural Information Processing Systems. Vol. 30 Curran Associates, Inc.; Red Hook, NY, USA: 2017 Walther, D., & Koch, C. ...

2017

[1] [1]

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., & Kim, B. (2020). Sanity checks for saliency maps. https://doi.org/10.48550/arXiv.1810.03292 Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Gar- cia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., & Herrera, F. (2020). Explainable Artif...

work page doi:10.48550/arxiv.1810.03292 2020

[2] [2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248–255. Diaz, L. T., & Alvarez, G. A. Ventral Stream Responses to Inanimate Objects are Equally Aligned with AlexNet (2012) and Modern Deep Neural Networks.https://2025.ccneuro.org/ abstract_pdf/Diaz_2025_Ventral_Stream_R...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2012

[3] [3]

P., & Tipper, S

Frischen, A., Bayliss, A. P., & Tipper, S. P. (2007). Gaze cueing of attention: visual attention, social cognition, and individual differences.Psychological bulletin,133(4),

2007

[4] [4]

A., & Brendel, W

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2018, November). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International conference on learning representations. Geva, M., Schuster, R., Berant, J., & Levy, O. (2021, November). Transformer feed-forward la...

work page doi:10.1002/wcs.113 2018

[5] [5]

L., Kumar, S., Sun, X., Mittal, A.,

Koorathota, S., Papadopoulos, N., Ma, J. L., Kumar, S., Sun, X., Mittal, A., ... & Sajda, P. (2023). Fixating on attention: Integrating human eye tracking into vision transformers. arXiv preprint arXiv:2308.13969. Krajbich I., Armel C., Rangel A. (2010). Visual fixations and the computation and comparison of value in simple choice. Nat. Neurosci. 13 1292–...

work page doi:10.1038/nn.263 2023

[6] [6]

LeCun, Y., & Bengio, Y. (1998). Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., ... &Gao, J.(2020, August). Oscar: Object- semantics aligned pre-training for vision-language tasks. InEuropean conference on computer vision(pp. 121-137). Cham...

work page doi:10.1016/j.tics.2025.11.016 1998

[7] [7]

J., Iyer, A., Itti, L., & Koch, C

Peters, R. J., Iyer, A., Itti, L., & Koch, C. (2005). Components of bottom-up gaze allocation in natural images. Vision research, 45(18), 2397-2416. Petersen, S. E., & Posner, M. I. (2012). The attention system of the human brain: 20 years after.Annual review of neuroscience,35, 73-89. Piñero, L. G. O., Carrasco, M., Aranda, J., & González, C. (2025). Com...

2005

[8] [8]

Attention is All you Need

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L.u., Polosukhin I. Attention is All you Need. In: Guyon I., Luxburg U.V., Bengio S., Wallach H., Fergus R., Vishwanathan S., Garnett R., editors. Advances in Neural Information Processing Systems. Vol. 30 Curran Associates, Inc.; Red Hook, NY, USA: 2017 Walther, D., & Koch, C. ...

2017