pith. machine review for the scientific record. sign in

arxiv: 2604.23452 · v1 · submitted 2026-04-25 · 💻 cs.CV · cs.LG

Recognition: unknown

From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:18 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords vision transformersspatial hierarchylinear probingdepth estimationboundary detectioninterpretabilitycausal intervention
0
0 comments X

The pith

Classification-trained Vision Transformers encode spatial structure in a clear layer-wise hierarchy, with boundaries appearing before depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks how Vision Transformers trained only for image classification acquire the ability to represent spatial properties like edges and depth, which are needed for other tasks but never explicitly supervised. It probes each layer of a frozen ViT-B/16 on boundary detection and depth estimation tasks, finding that local boundary signals become decodable around layers 5-6 while global depth signals peak at layer 8. Both signals largely disappear by the final classification layer, and the encodings are shown to be learned rather than random or architectural. Causal tests confirm that depth information is actively maintained along specific directions and partially recomputed at successive layers.

Core claim

A classification-trained ViT develops an actively maintained spatial hierarchy that mirrors the early-to-late progression observed in the primate visual cortex.

What carries the argument

Layer-wise linear probing for patch boundaries and per-patch depth, combined with single-direction ablation and targeted activation patching to test causal maintenance of the signals.

If this is right

  • Boundary information is encoded earlier than depth because local cues can be extracted from smaller receptive fields while depth requires integration across the image.
  • The depth signal is actively re-derived at multiple layers rather than passively carried forward in the residual stream.
  • Mid-layer interventions on the depth direction persist more strongly to later layers than early or late interventions.
  • Both spatial signals are discarded by the time the model reaches its final classification output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed ordering may explain why classification-pretrained ViTs transfer readily to dense prediction tasks even though they receive no spatial labels during training.
  • If the hierarchy is general, similar probing on other architectures could reveal whether the same early-to-late spatial progression appears outside transformers.
  • The collapse of spatial signals at the final layer suggests that any downstream spatial task must read from intermediate layers rather than the classifier head.

Load-bearing premise

Linear probes and single-direction ablations faithfully isolate the model's internal spatial representations without artifacts from the probing process or dataset choices.

What would settle it

Ablating the direction identified by the depth probe fails to degrade depth decoding performance by more than a few percent, or the same layer-wise ordering of boundary-then-depth does not appear when the same probes are applied to a different classification-trained ViT.

Figures

Figures reproduced from arXiv: 2604.23452 by Jainum Sanghavi.

Figure 1
Figure 1. Figure 1: Boundary detection AP by layer. Pretrained ViT (blue) peaks at layer 5 (AP = 0.833); view at source ↗
Figure 2
Figure 2. Figure 2: Depth probing MAE by layer. Pretrained ViT (blue) reaches best MAE at layers 5–8; view at source ↗
Figure 3
Figure 3. Figure 3: Cross-task comparison (pretrained linear probes). Green/left axis: boundary AP (higher view at source ↗
Figure 4
Figure 4. Figure 4: MAE increase (%) when ablating the probe’s weight direction (blue) vs. a random direction view at source ↗
Figure 5
Figure 5. Figure 5: Dose-response at layer 8. MAE increases smoothly as ablation strength view at source ↗
Figure 6
Figure 6. Figure 6: Causal influence matrix. Rows: intervention layer view at source ↗
read the original abstract

Vision Transformers trained only on image classification routinely transfer to tasks that demand spatial understanding, yet they receive no spatial supervision during pretraining. We ask where and how robustly such structure is encoded. Probing a frozen ViT-B/16 layerwise for two complementary properties, local patch boundaries (BSDS500) and per-patch depth (NYU Depth V2), reveals a clear hierarchy: boundary structure becomes linearly decodable at layers 5-6 (AP = 0.833), while depth, which requires integrating global cues, peaks two to three layers later at layer 8 (MAE = 0.0875). Both signals collapse at the final classification layer, and random-weight controls confirm the encodings are learned rather than architectural. Causal interventions add specificity: ablating the single direction a linear depth probe reads degrades depth decoding by up to 165%, while ablating any other direction changes it by less than 1%. Targeted activation patching along that direction shows the depth signal is partially re-derived at each layer rather than passively carried in the residual stream, with mid-layer interventions persisting most strongly downstream. The result is that a classification-trained ViT develops an actively maintained spatial hierarchy that mirrors the early-to-late progression observed in the primate visual cortex.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that Vision Transformers pretrained solely on image classification develop an internal spatial hierarchy: linear probes on frozen ViT-B/16 layers show local boundary information (BSDS500) becoming decodable at layers 5-6 (AP=0.833) while global depth (NYU Depth V2) peaks at layer 8 (MAE=0.0875), both collapsing at the final layer. Random-weight controls confirm the encodings are learned. Causal interventions demonstrate specificity, with ablation of the depth-probe direction degrading performance by up to 165% (vs. <1% for other directions) and activation patching indicating the depth signal is actively re-derived rather than passively carried in the residual stream. The authors conclude this mirrors the early-to-late progression in primate visual cortex.

Significance. If the central claims hold after addressing verification gaps, the work provides concrete empirical evidence for emergent spatial structure in classification-only ViTs, using reproducible public datasets (BSDS500, NYU Depth V2) and linear probes. This strengthens understanding of how transformers acquire hierarchical representations without spatial supervision and offers a potential bridge to neuroscience findings on visual cortex organization. The ablation and patching results, if non-circular, would be a notable contribution to mechanistic interpretability in vision models.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (causal interventions): The reported 165% degradation when ablating the single direction identified by the linear depth probe is partly tautological, as the direction is chosen precisely to align with that probe's readout; this does not independently establish that the model actively maintains a depth representation independent of the probe construction. A control using a held-out depth dataset, nonlinear probe, or orthogonal direction would be required to support the 'actively maintained' claim.
  2. [Abstract and §3] Abstract and §3 (probing results): The layer-wise progression (boundaries at 5-6, depth at 8) is reported with specific AP=0.833 and MAE=0.0875 but without error bars, multiple random seeds, or statistical tests for the differences across layers; this weakens the precision of the hierarchy claim and the assertion that depth 'peaks two to three layers later.'
  3. [§4] §4 (patching experiments): The claim that targeted patching shows the depth signal is 'partially re-derived at each layer' relies on persistence of the patched direction downstream, but without details on how re-derivation is quantified separately from residual-stream carry-over or probe sensitivity, it is unclear whether the intervention isolates active maintenance versus other dynamics.
minor comments (2)
  1. [Abstract] The abstract states 'random-weight controls confirm the encodings are learned' but provides no quantitative comparison values or layer-wise results for the random baseline.
  2. [§3 or §4] Notation for the depth direction (e.g., how the single direction is extracted from the probe weights) should be formalized with an equation in §3 or §4 to allow exact reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below, providing clarifications and indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (causal interventions): The reported 165% degradation when ablating the single direction identified by the linear depth probe is partly tautological, as the direction is chosen precisely to align with that probe's readout; this does not independently establish that the model actively maintains a depth representation independent of the probe construction. A control using a held-out depth dataset, nonlinear probe, or orthogonal direction would be required to support the 'actively maintained' claim.

    Authors: We agree that the direction is identified via the linear probe and that this introduces some dependence. However, the key evidence for functional relevance lies in the extreme specificity of the effect: ablating this direction degrades depth decoding by up to 165%, while ablating any of the other 767 directions changes performance by less than 1%. This differential impact indicates that the direction encodes information the model relies upon, beyond mere probe construction. To further address the concern, we will add an orthogonal-direction ablation control in the revised §4, confirming that only the depth-aligned direction produces the large degradation. revision: partial

  2. Referee: [Abstract and §3] Abstract and §3 (probing results): The layer-wise progression (boundaries at 5-6, depth at 8) is reported with specific AP=0.833 and MAE=0.0875 but without error bars, multiple random seeds, or statistical tests for the differences across layers; this weakens the precision of the hierarchy claim and the assertion that depth 'peaks two to three layers later.'

    Authors: The referee correctly notes the absence of variability estimates and statistical support. We will recompute all linear probes across five random seeds, report mean values with standard-error bars, and include paired statistical tests (e.g., t-tests) comparing the peak-layer performance against the preceding and following layers. These additions will be incorporated into the revised §3 and abstract to rigorously substantiate the reported two-to-three-layer offset. revision: yes

  3. Referee: [§4] §4 (patching experiments): The claim that targeted patching shows the depth signal is 'partially re-derived at each layer' relies on persistence of the patched direction downstream, but without details on how re-derivation is quantified separately from residual-stream carry-over or probe sensitivity, it is unclear whether the intervention isolates active maintenance versus other dynamics.

    Authors: We will expand the methods and results in §4 to provide the missing quantification details. Re-derivation is measured by comparing the downstream probe readout after targeted patching against two controls: (i) a passive-carry baseline in which the same activation is injected but subsequent layer updates are zeroed, and (ii) a sensitivity-matched random-direction patch. The difference in persistence between the targeted patch and these controls isolates active re-derivation from residual-stream propagation and probe artifacts. Expanded equations and additional figures will be added to make this separation explicit. revision: partial

Circularity Check

1 steps flagged

Probe-derived direction ablation reduces to tautological degradation by construction

specific steps
  1. fitted input called prediction [Abstract (causal interventions)]
    "ablating the single direction a linear depth probe reads degrades depth decoding by up to 165%, while ablating any other direction changes it by less than 1%."

    The direction is extracted by fitting the linear probe to the model's activations for the depth task. Removing precisely that fitted direction and re-evaluating the identical linear probe on depth must produce large degradation by construction, as the probe's weights define the direction being removed. This does not independently establish that the ViT actively maintains depth information in that direction beyond the probe's own fitting assumptions.

full rationale

The paper's core results consist of layer-wise linear probing on external public datasets (BSDS500 for boundaries, NYU Depth V2 for depth) to measure when spatial signals become decodable in a frozen classification ViT. These measurements do not reduce to quantities defined by the probes themselves. However, the causal intervention step identifies a single direction via the linear depth probe and then ablates it to quantify degradation in the same depth decoding task. This step is partly tautological because any direction aligned by the probe fit will dominate removal effects on that probe. The patching experiments provide somewhat more independent evidence of re-derivation. Overall, the hierarchy claim retains independent empirical content from external benchmarks, but the ablation specificity claim has this reduction, warranting a moderate circularity score.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The study is empirical and relies on standard assumptions of linear probing and causal intervention in activation space; no new mathematical entities or free parameters beyond the fitted probe weights are introduced.

free parameters (1)
  • linear probe weights
    Weights of the boundary and depth probes are fitted to activations at each layer.
axioms (2)
  • domain assumption Linear decodability from activations indicates the presence of the corresponding information in the model
    Invoked when interpreting probe AP and MAE as evidence of encoded structure.
  • domain assumption Ablating the probe direction isolates the causal contribution of that direction to downstream decoding
    Basis for the 165% degradation claim and the re-derivation conclusion.

pith-pipeline@v0.9.0 · 5518 in / 1456 out tokens · 36274 ms · 2026-05-08T08:18:10.167147+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Alain and Y

    G. Alain and Y. Bengio. Understanding intermediate layers using linear classifier probes. In ICLR Workshop, 2017

  2. [2]

    Belrose, D

    N. Belrose, D. Schneider-Joseph, S. Ravfogel, R. Cotterell, E. Raff, and S. Biderman. LEACE : Perfect linear concept erasure in closed form. In NeurIPS, 2023

  3. [3]

    Caron, H

    M. Caron, H. Touvron, I. Misra, H. J \'e gou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021

  4. [4]

    Chowdhury, S

    A. Chowdhury, S. Gupta, and A. Basu. PROMPT-CAM : A simpler interpretable transformer for fine-grained analysis. arXiv preprint arXiv:2501.09133, 2025

  5. [5]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

  6. [6]

    How to use and interpret activation patching

    S. Heimersheim and N. Nanda. How to use and interpret activation patching. arXiv preprint arXiv:2404.15255, 2024

  7. [7]

    Hewitt and P

    J. Hewitt and P. Liang. Designing and interpreting probes with control tasks. In EMNLP, 2019

  8. [8]

    Hewitt and C

    J. Hewitt and C. Manning. A structural probe for finding syntax in word representations. In NAACL, 2019

  9. [9]

    Z. Lipton. The mythos of model interpretability. Queue, 16(3), 2018

  10. [10]

    Du, T.-N

    Y. Qiang, Z. Li, and W. Chen. IA-ViT : Interpretability-aware Vision Transformer. arXiv preprint arXiv:2501.10785, 2025

  11. [11]

    Raghu, T

    M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy. Do vision transformers see like convolutional neural networks? In NeurIPS, 2021

  12. [12]

    Silberman, D

    N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from RGBD images. In ECCV, 2012

  13. [13]

    H. Zhou, R. Zhang, P. Lai, C. Guo, Y. Wang, Z. Sun, and J. Li. EL-VIT : Probing Vision Transformer with interactive visualization. arXiv preprint arXiv:2401.12666, 2024