Inducing Spatial Locality in Vision Transformers through the Training Protocol

Asael Fabian Mart\'inez; Eduardo Santiago Toledo

arxiv: 2605.16390 · v1 · pith:UOKAYZDLnew · submitted 2026-05-11 · 💻 cs.CV · cs.LG· stat.ML

Inducing Spatial Locality in Vision Transformers through the Training Protocol

Eduardo Santiago Toledo , Asael Fabian Mart\'inez This is my paper

Pith reviewed 2026-05-20 21:50 UTC · model grok-4.3

classification 💻 cs.CV cs.LGstat.ML

keywords vision transformerspatial localitycutmixtraining protocolmean attention distancedata augmentationattention headscifar

0 comments

The pith

CutMix augmentation during training induces spatial locality in early layers of Vision Transformers trained from scratch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether the training protocol alone can make Vision Transformers develop local attention in their first layers without any pretraining on large data. The authors hold the model architecture fixed and compare a basic training setup against one that adds AutoAugment, CutMix, and label smoothing, then track attention behavior with Mean Attention Distance and entropy on three small image datasets. The Modern protocol consistently produces attention that stays closer to local image regions, and an ablation isolates CutMix as the component responsible for the shift. Readers may care because this points to a lightweight way to encourage locality through data handling rather than model redesign.

Core claim

Keeping architecture and optimization fixed, the Modern training protocol produces more local and concentrated attention in early layers compared to Baseline. On CIFAR-100, minimum MAD drops from 0.316 to 0.008. Ablation identifies CutMix as the determining factor, with all CutMix conditions showing MAD around 0.024 versus 0.210 without it. This suggests that the need to classify from partial image regions drives the emergence of local attention.

What carries the argument

CutMix data augmentation, which pastes random patches from one image onto another and mixes their labels, creating pressure to classify using local evidence.

If this is right

Across CIFAR-10, CIFAR-100, and Tiny-ImageNet the Modern protocol yields lower MAD and higher attention concentration in early layers.
AutoAugment and Label Smoothing produce no measurable independent effect on locality when added or removed alone.
All training conditions that include CutMix converge to the same low MAD value of 0.024.
The locality effect appears in the earliest layers where global attention would otherwise dominate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data augmentations like CutMix might serve as a practical substitute for architectural changes that hard-code local receptive fields.
The same training pressure could be tested on larger datasets to see whether the induced locality scales or saturates.
If local attention improves robustness to occlusions, CutMix-style protocols could be added to existing ViT training recipes with little extra cost.

Load-bearing premise

The observed MAD differences are caused specifically by CutMix rather than by unmeasured interactions with other training details or random choices, and that MAD faithfully measures functionally relevant spatial locality.

What would settle it

Re-running the ablation with fixed random seeds and identical code, then checking whether MAD still separates cleanly into the 0.024 versus 0.210 groups or whether downstream accuracy on occlusion-heavy tasks remains unchanged.

Figures

Figures reproduced from arXiv: 2605.16390 by Asael Fabian Mart\'inez, Eduardo Santiago Toledo.

**Figure 2.** Figure 2: Attentional behavior space (CIFAR-100). Each point represents an [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Heatmaps for CIFAR-100 (Y-axis: Layers 1–8; X-axis: Heads 1–8). [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Minimum MAD per ablation condition, sorted in ascending order. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: MAD vs. entropy for four key ablation conditions (CIFAR-100). [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

We investigate whether the training protocol can induce spatial locality in the early layers of a Vision Transformer (ViT) trained from scratch, without large-scale pretraining. Keeping the architecture and optimization procedure fixed, we compare a Baseline protocol with a Modern protocol (AutoAugment/ColorJitter, CutMix, and Label Smoothing) on CIFAR-10, CIFAR-100, and Tiny-ImageNet, characterizing each attention head via Mean Attention Distance (MAD) and normalized entropy. Across all three datasets, the Modern protocol produces more local and more concentrated attention in early layers; on CIFAR-100, the minimum MAD drops from 0.316 (Baseline) to 0.008 (Modern). To identify the source of this effect, we conduct an ablation study on CIFAR-100 by adding or removing each component individually. The results identify CutMix as the determining component within our experiments: all conditions with CutMix exhibit MAD 0.024, while all conditions without CutMix remain at MAD 0.210. AutoAugment and Label Smoothing show no independent effect on locality. Taken together, these findings suggest that the pressure to classify from partial image regions, induced by CutMix, can promote the emergence of local attention in Vision Transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CutMix appears to be the main reason early ViT layers develop local attention when training from scratch on CIFAR-scale data.

read the letter

The paper's central finding is that CutMix, rather than the other modern training elements, is what pushes Vision Transformers toward local attention in their early layers. They train the same ViT architecture from scratch on CIFAR-10, CIFAR-100, and Tiny-ImageNet using either a basic protocol or one with AutoAugment, ColorJitter, CutMix, and label smoothing. The modern version yields much lower mean attention distance in early layers, and on CIFAR-100 the minimum MAD drops from 0.316 to 0.008.

Referee Report

1 major / 2 minor

Summary. The paper claims that a Modern training protocol (AutoAugment/ColorJitter, CutMix, Label Smoothing) induces spatial locality in the early layers of Vision Transformers trained from scratch, without large-scale pretraining. Keeping architecture and optimizer fixed, experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show lower Mean Attention Distance (MAD) and more concentrated attention under the Modern protocol; an ablation on CIFAR-100 isolates CutMix as the key driver, with all CutMix conditions yielding MAD 0.024 versus 0.210 without it.

Significance. If the ablation result holds under tighter controls, the work provides evidence that a specific augmentation (CutMix) can promote locality biases in ViTs on small datasets, offering a practical route to reduce dependence on pretraining for certain architectural properties. The clean quantitative separation in the reported MAD values is a strength of the empirical design.

major comments (1)

[Ablation study (abstract and §4)] Ablation study paragraph: The claim that CutMix is the sole determining component is load-bearing for the central thesis, yet the manuscript provides no information on the number of independent runs performed, whether random seeds were fixed or shared across the eight ablation conditions, or the precise procedure for computing MAD (e.g., averaging over which heads/layers, handling of batch ordering). Without these controls, the binary MAD split (0.024 vs. 0.210) could reflect unmeasured interactions between CutMix and other protocol elements rather than a direct causal effect.

minor comments (2)

[Abstract and Methods] The abstract states that the Modern protocol produces 'more local and more concentrated attention' but does not define normalized entropy or cite its computation formula; this should be added to the methods section for reproducibility.
[Results (ablation table)] Table or figure reporting the per-condition MAD values should include standard deviations or confidence intervals to allow assessment of run-to-run stability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern about the ablation study controls below and will incorporate the requested details in the revision.

read point-by-point responses

Referee: [Ablation study (abstract and §4)] Ablation study paragraph: The claim that CutMix is the sole determining component is load-bearing for the central thesis, yet the manuscript provides no information on the number of independent runs performed, whether random seeds were fixed or shared across the eight ablation conditions, or the precise procedure for computing MAD (e.g., averaging over which heads/layers, handling of batch ordering). Without these controls, the binary MAD split (0.024 vs. 0.210) could reflect unmeasured interactions between CutMix and other protocol elements rather than a direct causal effect.

Authors: We agree that the manuscript should provide these experimental details to support reproducibility and the causal interpretation. Although omitted from the current text for brevity, the ablation was run with multiple independent trials using distinct random seeds for each of the eight conditions. MAD was computed by averaging attention distances over all heads in layers 1-2 on a held-out validation set with randomized batch order. We will add a dedicated paragraph in the revised §4 (and appendix) describing the full protocol, number of runs, seed handling, and exact MAD procedure. This will confirm the robustness of the 0.024 vs. 0.210 split and rule out seed- or ordering-dependent artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ablation on public benchmarks

full rationale

The paper reports direct experimental measurements of Mean Attention Distance (MAD) and normalized entropy under controlled training protocols on CIFAR-10, CIFAR-100, and Tiny-ImageNet. The central claim—that CutMix is the determining factor—is supported by an ablation that adds or removes each component individually and records the resulting MAD values (0.024 with CutMix vs. 0.210 without). No derivations, equations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The study is self-contained against external benchmarks; any concerns about unmeasured interactions or seed control pertain to experimental validity rather than circular reduction of the result to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical comparison rather than derivation. No free parameters or invented entities are introduced. The key background assumption is that Mean Attention Distance is a valid measure of spatial locality.

axioms (1)

domain assumption Mean Attention Distance is a faithful and sufficient proxy for spatial locality relevant to model behavior.
This metric is used to interpret all results as evidence of locality induction.

pith-pipeline@v0.9.0 · 5759 in / 1427 out tokens · 86113 ms · 2026-05-20T21:50:23.775717+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. AutoAug- ment: Learning augmentation strategies from data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 113–123, 2019. doi: 10.1109/CVPR.2019.00020

work page doi:10.1109/cvpr.2019.00020 2019
[2]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16×16 words: Trans- formers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[3]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/ CVPR.2016.90

work page 2016
[4]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. In K. Chaudhuri and R. Salakhutdi- nov, editors,Proceedings of the 36th International Conference on Ma- chine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3519–3529. PMLR, 2019

work page 2019
[5]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009
[6]

Le and X

Y. Le and X. Yang. Tiny ImageNet visual recognition challenge. CS231N, Stanford University, 2015. 21

work page 2015
[7]

Gradient-based learning applied to document recognition,

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learn- ing applied to document recognition.Proceedings of the IEEE, 86(11): 2278–2324, 1998. doi: 10.1109/5.726791

work page doi:10.1109/5.726791 1998
[8]

Loshchilov and F

I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. InInternational Conference on Learning Representations (ICLR), 2017

work page 2017
[9]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019

work page 2019
[10]

Müller, S

R. Müller, S. Kornblith, and G. Hinton. When does label smooth- ing help? InAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[11]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. PyTorch: An im- perative style, high-performance deep learning library. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[12]

Raghu, T

M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovit- skiy. Do vision Transformers see like convolutional neural networks? In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[13]

C. E. Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27:379–423, 623–656, 1948

work page 1948
[14]

Khoshgoftaar

C. Shorten and T. M. Khoshgoftaar. A survey on image data aug- mentation for deep learning.Journal of Big Data, 6(1):60, 2019. doi: 10.1186/s40537-019-0197-0

work page doi:10.1186/s40537-019-0197-0 2019
[15]

Touvron, M

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Je- gou. Training data-efficient image Transformers & distillation through attention. InProceedings of the 38th International Conference on Ma- chine Learning, volume 139 ofProceedings of Machine Learning Re- search, pages 10347–10357. PMLR, 2021

work page 2021
[16]

Kaiser, and I

A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N.Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

work page 2017
[17]

S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. CutMix: Regu- larization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6023–6032, 2019. doi: 10.1109/ICCV.2019.00612. 22

work page doi:10.1109/iccv.2019.00612 2019

[1] [1]

E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. AutoAug- ment: Learning augmentation strategies from data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 113–123, 2019. doi: 10.1109/CVPR.2019.00020

work page doi:10.1109/cvpr.2019.00020 2019

[2] [2]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16×16 words: Trans- formers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[3] [3]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/ CVPR.2016.90

work page 2016

[4] [4]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. In K. Chaudhuri and R. Salakhutdi- nov, editors,Proceedings of the 36th International Conference on Ma- chine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3519–3529. PMLR, 2019

work page 2019

[5] [5]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009

[6] [6]

Le and X

Y. Le and X. Yang. Tiny ImageNet visual recognition challenge. CS231N, Stanford University, 2015. 21

work page 2015

[7] [7]

Gradient-based learning applied to document recognition,

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learn- ing applied to document recognition.Proceedings of the IEEE, 86(11): 2278–2324, 1998. doi: 10.1109/5.726791

work page doi:10.1109/5.726791 1998

[8] [8]

Loshchilov and F

I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. InInternational Conference on Learning Representations (ICLR), 2017

work page 2017

[9] [9]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019

work page 2019

[10] [10]

Müller, S

R. Müller, S. Kornblith, and G. Hinton. When does label smooth- ing help? InAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[11] [11]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. PyTorch: An im- perative style, high-performance deep learning library. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[12] [12]

Raghu, T

M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovit- skiy. Do vision Transformers see like convolutional neural networks? In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[13] [13]

C. E. Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27:379–423, 623–656, 1948

work page 1948

[14] [14]

Khoshgoftaar

C. Shorten and T. M. Khoshgoftaar. A survey on image data aug- mentation for deep learning.Journal of Big Data, 6(1):60, 2019. doi: 10.1186/s40537-019-0197-0

work page doi:10.1186/s40537-019-0197-0 2019

[15] [15]

Touvron, M

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Je- gou. Training data-efficient image Transformers & distillation through attention. InProceedings of the 38th International Conference on Ma- chine Learning, volume 139 ofProceedings of Machine Learning Re- search, pages 10347–10357. PMLR, 2021

work page 2021

[16] [16]

Kaiser, and I

A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N.Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

work page 2017

[17] [17]

S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. CutMix: Regu- larization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6023–6032, 2019. doi: 10.1109/ICCV.2019.00612. 22

work page doi:10.1109/iccv.2019.00612 2019