pith. machine review for the scientific record. sign in

arxiv: 2605.12021 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords what-where separationvision transformerobject discoveryattention mapsweakly supervised segmentationslot-based architecturelocalizationinductive bias
0
0 comments X

The pith

What-Where Transformer separates object appearance from location in concurrent streams to produce emergent multi-object discovery from raw attention maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an inductive bias called what-where separation inside a Vision Transformer backbone. It treats tokens as representations of object appearance and attention maps as representations of spatial location, then routes them through separate concurrent feed-forward modules in a slot-based design. This decomposition lets both the tokens and the maps receive direct gradients from task losses, so localization information is learned explicitly rather than suppressed. A sympathetic reader would care because standard classification backbones often entangle or discard location cues, making downstream tasks like discovery and segmentation harder; the separation offers a way to obtain both kinds of information from the same forward pass without extra supervision or post-processing.

Core claim

By processing tokens as what-representations and attention maps as where-representations in concurrent feed-forward modules of a multi-stream slot-based architecture, the What-Where Transformer achieves what-where separation throughout an attentive backbone. The final-layer tokens and attention maps are reused directly for downstream tasks and exposed to task-loss gradients, enabling effective localization learning. Even when trained only with single-label classification supervision on ImageNet, the model exhibits emergent multiple object discovery directly from its raw attention maps without token clustering or other post-processing.

What carries the argument

A multi-stream, slot-based architecture that processes tokens (what-representations) and attention maps (where-representations) in concurrent feed-forward modules.

If this is right

  • Achieves higher performance than ViT-based methods on zero-shot object discovery.
  • Outperforms prior approaches on weakly supervised semantic segmentation.
  • Transfers to multiple localization setups with only minimal architectural changes.
  • Produces multiple object discovery directly from raw attention maps without clustering or other post-processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation could simplify end-to-end pipelines for dense prediction tasks by removing the need for separate localization heads or clustering stages.
  • Because the maps are already exposed to gradients, the model might support fine-grained localization even when only coarse labels are available during training.
  • The concurrent what-where streams might be combined with existing object-centric models to improve slot binding without changing the supervision regime.

Load-bearing premise

That treating tokens and attention maps as separate what and where streams in concurrent modules will keep the two kinds of information from entangling and will allow localization to be learned from task losses alone.

What would settle it

Train the model on standard single-label ImageNet classification and check whether the raw final-layer attention maps contain spatially distinct activations for multiple separate objects in the same image; failure to observe such activations would falsify the emergence claim.

Figures

Figures reproduced from arXiv: 2605.12021 by Ikuro Sato, Masahiro Kada, Rei Kawakami, Ryota Yoshihashi, Satoshi Ikehata.

Figure 1
Figure 1. Figure 1: Concept-level illustrations of a) conventional [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of ViT’s token￾token connectivity and WWT’s token-slot connectivity. The WWT network is configured as a stack of WWT blocks. We omit spatial hierarchy or gradual downsampling for sim￾plicity and to maintain the original-resolution attentions. In the first WWT block of the network, the initial tokens are ob￾tained by applying a linear transformation to the RGB values of each patch. The initial … view at source ↗
Figure 4
Figure 4. Figure 4: Task heads utilizing slot-mask representations from WWT (see Sec. 3.2). [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of per-slot output masks. Red (a): Class-bound slots attend to semantically [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Translation invariance of patch tokens and slots. The results in Tab. 5 suggest that WWT performed mask￾based discovery comparably with the SA-based baseline method, i.e., DINOSAUR, in the distillation-based setting. While some SA-based methods benefit from stronger au￾toregressive Transformer decoders, WWT enables object centricity within the backbone itself. Its ability to per￾form this task without addi… view at source ↗
read the original abstract

Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One possible reason is that classification-oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where. In this work, we focus on an inductive bias termed what-where separation, which encourages models to represent object appearance and spatial location in a decomposed manner. To incorporate this bias throughout an attentive backbone in the style of Vision Transformer (ViT), we propose the What-Where Transformer (WWT). Our method introduces two key novel designs: (1) it treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed-forward modules via a multi-stream, slot-based architecture; (2) it reuses both the final-layer tokens and attention maps for downstream tasks, and directly exposes them to gradients derived from task losses, thereby facilitating more effective and explicit learning of localization. We demonstrate that even under standard single-label classification-based supervision on ImageNet, WWT exhibits emergent multiple object discovery directly from raw attention maps, rather than via additional postprocessing such as token clustering. Furthermore, WWT achieves superior performance compared to ViT-based methods on zero-shot object discovery and weakly supervised semantic segmentation, and it is transferable to various localization setups with minimal modifications. Code will be published after acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes the What-Where Transformer (WWT), a slot-centric Vision Transformer variant that enforces an inductive bias for what-where separation. Tokens are treated as what-representations and attention maps as where-representations; these are processed in concurrent multi-stream feed-forward modules. The final-layer tokens and attention maps are reused for downstream tasks and directly optimized by task losses. The central empirical claim is that, under standard single-label ImageNet classification supervision, WWT produces emergent multiple-object discovery directly from raw final-layer attention maps without token clustering or other post-processing, while also improving zero-shot object discovery and weakly-supervised semantic segmentation relative to ViT baselines and transferring to other localization tasks.

Significance. If the empirical claims are substantiated with proper controls, the work would be moderately significant for vision backbones: it offers a concrete architectural mechanism to reduce entanglement between semantic and spatial information without requiring explicit localization supervision or auxiliary losses. The reported transferability to multiple localization setups and the avoidance of post-processing steps would be useful if the separation is shown to be robust rather than an artifact of the slot design.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (experimental results): The claim of 'emergent multiple object discovery directly from raw attention maps' without post-processing is load-bearing for the novelty argument, yet the manuscript does not specify the exact extraction procedure (e.g., whether per-head selection, averaging, or simple thresholding is applied before visualization or metric computation). If any such step is used, it must be shown to be strictly weaker than the token-clustering baselines it is contrasted against; otherwise the separation advantage is not cleanly demonstrated.
  2. [§3.2] §3.2 (architecture) and ablation studies: The concurrent what/where feed-forward modules are presented as the source of clean decomposition, but no direct ablation compares WWT against a standard ViT with identical slot count and attention-map reuse under the same ImageNet supervision. Without this control, it remains unclear whether the observed localization gains arise from the what-where split or simply from the multi-stream slot architecture.
  3. [Table 2, Table 3] Table 2 (zero-shot discovery) and Table 3 (weakly-supervised segmentation): Performance numbers are reported without standard deviations across multiple runs or seeds, and the baselines appear to use the same ViT backbone without the concurrent modules. This makes it difficult to assess whether the reported gains are statistically reliable or attributable to the proposed separation rather than hyper-parameter differences.
minor comments (2)
  1. [§3] Notation for the slot streams and the reuse of attention maps for gradient flow should be introduced with a single diagram and consistent symbols in §3; current prose descriptions are occasionally ambiguous about which tensors receive task gradients.
  2. The manuscript states that code will be released after acceptance; adding a reproducibility checklist (data splits, exact hyper-parameters, and the precise attention-map extraction code) would strengthen the empirical claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, providing clarifications and committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (experimental results): The claim of 'emergent multiple object discovery directly from raw attention maps' without post-processing is load-bearing for the novelty argument, yet the manuscript does not specify the exact extraction procedure (e.g., whether per-head selection, averaging, or simple thresholding is applied before visualization or metric computation). If any such step is used, it must be shown to be strictly weaker than the token-clustering baselines it is contrasted against; otherwise the separation advantage is not cleanly demonstrated.

    Authors: We will revise the manuscript to explicitly detail the extraction procedure. The final-layer attention maps are used in their raw form for both visualization and quantitative metrics (e.g., object discovery evaluation), with only standard multi-head averaging applied as is conventional in ViT attention analysis—no per-head selection, thresholding, clustering, or other post-processing steps. This procedure is indeed minimal and weaker than the token-clustering baselines we compare against, directly supporting the emergent separation claim. Updated description and examples will be added to §4 and the appendix. revision: yes

  2. Referee: [§3.2] §3.2 (architecture) and ablation studies: The concurrent what/where feed-forward modules are presented as the source of clean decomposition, but no direct ablation compares WWT against a standard ViT with identical slot count and attention-map reuse under the same ImageNet supervision. Without this control, it remains unclear whether the observed localization gains arise from the what-where split or simply from the multi-stream slot architecture.

    Authors: This is a fair point on isolating the contribution of the concurrent modules. Standard ViT lacks native slot-centric processing and direct attention-map reuse, so a perfect 1:1 control is not straightforward. However, we will add a new ablation in the revised §3.2 and experiments comparing WWT to a merged single-stream slot variant (same slot count, attention reuse, and supervision) to isolate the effect of the what/where split. This will clarify that the gains stem from the concurrent design rather than slots alone. revision: partial

  3. Referee: [Table 2, Table 3] Table 2 (zero-shot discovery) and Table 3 (weakly-supervised segmentation): Performance numbers are reported without standard deviations across multiple runs or seeds, and the baselines appear to use the same ViT backbone without the concurrent modules. This makes it difficult to assess whether the reported gains are statistically reliable or attributable to the proposed separation rather than hyper-parameter differences.

    Authors: We agree that standard deviations would enhance statistical reliability. Due to compute limits in the original runs, we reported single-run results, but we will re-execute the key experiments across 3 seeds and update Tables 2 and 3 with means ± std. Baselines were reimplemented under matched hyperparameters and training protocols where feasible; we will add explicit notes on any minor differences in the text and appendix to rule out confounds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on explicit architectural inductive bias rather than self-referential fits or citations.

full rationale

The paper defines WWT via two explicit design choices—treating tokens as what-representations and attention maps as where-representations in concurrent slot-based feed-forward modules, plus direct exposure of both to task-loss gradients—without any equation that reduces the claimed what-where separation or emergent discovery to a quantity fitted from the same data or imported via self-citation. The abstract presents the multiple-object discovery result as an empirical outcome under standard ImageNet supervision, not as a prediction derived from the architecture's own fitted parameters. No self-citation load-bearing step, uniqueness theorem, or ansatz smuggling appears in the derivation chain; the central separation is an imposed inductive bias whose effectiveness is evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the architecture is described at the level of inductive bias and module organization without numerical fitting details or unstated background assumptions.

pith-pipeline@v0.9.0 · 5586 in / 1139 out tokens · 52163 ms · 2026-05-13T06:55:01.964879+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 1 internal anchor

  1. [1]

    Quantifying attention flow in transformers

    Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InAnnual Meeting of the Association for Computational Linguistics, pages 4190–4197, 2020

  2. [2]

    MONet: Unsupervised Scene Decomposition and Representation

    Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. NONet: Unsupervised scene decomposition and representation.arXiv preprint arXiv:1901.11390, 2019

  3. [3]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean Conference on Computer Vision, pages 213–229, 2020

  4. [4]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021

  5. [5]

    Weakly-supervised semantic segmentation via sub-category exploration

    Yu-Ting Chang, Qiaosong Wang, Wei-Chih Hung, Robinson Piramuthu, Yi-Hsuan Tsai, and Ming-Hsuan Yang. Weakly-supervised semantic segmentation via sub-category exploration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8991–9000, 2020

  6. [6]

    Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks

    Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. InWinter Conference on Applications of Computer Vision, pages 839–847. IEEE, 2018

  7. [7]

    Mobile-former: Bridging mobilenet and transformer

    Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. Mobile-former: Bridging mobilenet and transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5270–5279, 2022

  8. [8]

    Dual path networks

    Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. Advances in Neural Information Processing Systems, 30, 2017

  9. [9]

    Siamese DETR

    Zeren Chen, Gengshi Huang, Wei Li, Jianing Teng, Kun Wang, Jing Shao, Chen Change Loy, and Lu Sheng. Siamese DETR. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15722–15731, 2023

  10. [10]

    Masked- attention mask transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked- attention mask transformer for universal image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022

  11. [11]

    Per-pixel classification is not all you need for semantic segmentation.Advances in Neural Information Processing Systems, 34:17864–17875, 2021

    Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation.Advances in Neural Information Processing Systems, 34:17864–17875, 2021

  12. [12]

    A dual-stream neural network explains the functional segregation of dorsal and ventral visual pathways in human brains

    Minkyu Choi, Kuan Han, Xiaokai Wang, Yizhen Zhang, and Zhongming Liu. A dual-stream neural network explains the functional segregation of dorsal and ventral visual pathways in human brains. In Advances in Neural Information Processing Systems, 2023

  13. [13]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.arXiv preprint, arXiv:1412.3555, 2014

  14. [14]

    Multi-column deep neural networks for image classification

    Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks for image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3642–3649. IEEE, 2012

  15. [15]

    Deep feature factorization for concept discovery

    Edo Collins, Radhakrishna Achanta, and Sabine Susstrunk. Deep feature factorization for concept discovery. InEuropean Conference on Computer Vision, pages 336–352, 2018

  16. [16]

    UP-DETR: Unsupervised pre-training for object detection with transformers

    Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. UP-DETR: Unsupervised pre-training for object detection with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1601–1610, 2021. 10

  17. [17]

    Vision transformers need registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InInternational Conference on Learning Representations, 2024

  18. [18]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, An- dreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational Conference on Machine Learning, pages 7480– 7512, 2023

  19. [19]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009

  20. [20]

    Perceptual group tokenizer: Building perception with iterative grouping.International Conference on Learning Representations, 2024

    Zhiwei Deng, Ting Chen, and Yang Li. Perceptual group tokenizer: Building perception with iterative grouping.International Conference on Learning Representations, 2024

  21. [21]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations

  22. [22]

    Ventral-dorsal neural networks: object detection via selective attention

    Mohammad K Ebrahimpour, Jiayun Li, Yen-Yun Yu, Jackson Reesee, Azadeh Moghtaderi, Ming-Hsuan Yang, and David C Noelle. Ventral-dorsal neural networks: object detection via selective attention. In Winter Conference on Applications of Computer Vision, pages 986–994. IEEE, 2019

  23. [23]

    CRAFT: Concept recursive activation factorization for explainability

    Thomas Fel, Agustin Picard, Louis Bethune, Thibaut Boissin, David Vigouroux, Julien Colin, Rémi Cadène, and Thomas Serre. CRAFT: Concept recursive activation factorization for explainability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2711–2721, 2023

  24. [24]

    Large-scale unsupervised semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7457–7476, 2022

    Shanghua Gao, Zhong-Yu Li, Ming-Hsuan Yang, Ming-Ming Cheng, Junwei Han, and Philip Torr. Large-scale unsupervised semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7457–7476, 2022

  25. [25]

    Multi-fold MIL training for weakly supervised object localization

    Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Multi-fold MIL training for weakly supervised object localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2409–2416, 2014

  26. [26]

    what” and “where

    Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, and Michael C Mozer. Decoupling the “what” and “where” with polar coordinate positional embeddings.arXiv preprint arXiv:2509.10534, 2025

  27. [27]

    Inductive biases for deep learning of higher-level cognition.Proceed- ings of the Royal Society A, 478(2266):20210068, 2022

    Anirudh Goyal and Yoshua Bengio. Inductive biases for deep learning of higher-level cognition.Proceed- ings of the Royal Society A, 478(2266):20210068, 2022

  28. [28]

    Emergence of complex-like cells in a temporal product network with local receptive fields.arXiv preprint arXiv:1006.0448, 2010

    Karo Gregor and Yann LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields.arXiv preprint arXiv:1006.0448, 2010

  29. [29]

    ViTOL: Vision transformer for weakly supervised object localization

    Saurav Gupta, Sourav Lakhotia, Abhay Rawat, and Rahul Tallamraju. ViTOL: Vision transformer for weakly supervised object localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4101–4110, 2022

  30. [30]

    Egocentric human activities recognition with multimodal interaction sensing.IEEE Sensors Journal, 24(5):7085–7096, 2024

    Yuzhe Hao, Asako Kanezaki, Ikuro Sato, Rei Kawakami, and Koichi Shinoda. Egocentric human activities recognition with multimodal interaction sensing.IEEE Sensors Journal, 24(5):7085–7096, 2024

  31. [31]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

  32. [32]

    Columbia university, 1997

    Orris C Herfindahl.Concentration in the steel industry. Columbia university, 1997

  33. [33]

    Transforming auto-encoders

    Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. InInternational conference on artificial neural networks, pages 44–51. Springer, 2011

  34. [34]

    Object-centric slot diffusion.Advances in Neural Information Processing Systems, arXiv:2303.10834, 2023

    Jindong Jiang, Fei Deng, Gautam Singh, and Sungjin Ahn. Object-centric slot diffusion.Advances in Neural Information Processing Systems, arXiv:2303.10834, 2023

  35. [35]

    SPOT: Self- training with patch-order permutation for object-centric learning with autoregressive transformers

    Ioannis Kakogeorgiou, Spyros Gidaris, Konstantinos Karantzalos, and Nikos Komodakis. SPOT: Self- training with patch-order permutation for object-centric learning with autoregressive transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22776– 22786, 2024. 11

  36. [36]

    Cross-connected networks for multi-task learning of detection and segmentation

    Rei Kawakami, Ryota Yoshihashi, Seiichiro Fukuda, Shaodi You, Makoto Iida, and Takeshi Naemura. Cross-connected networks for multi-task learning of detection and segmentation. pages 3636–3640. IEEE, 2019

  37. [37]

    On permutation- invariant neural networks.arXiv preprint arXiv:2403.17410, 2024

    Masanari Kimura, Ryotaro Shimizu, Yuki Hirakawa, Ryosuke Goto, and Yuki Saito. On permutation- invariant neural networks.arXiv preprint arXiv:2403.17410, 2024

  38. [38]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

  39. [39]

    Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation

    Jungbeom Lee, Eunji Kim, and Sungroh Yoon. Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

  40. [40]

    Scouter: Slot attention-based classifier for explainable image recognition

    Liangzhi Li, Bowen Wang, Manisha Verma, Yuta Nakashima, Ryo Kawasaki, and Hajime Nagahara. Scouter: Slot attention-based classifier for explainable image recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1046–1055, 2021

  41. [41]

    Exploring plain vision transformer backbones for object detection

    Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. InEuropean Conference on Computer Vision, pages 280–296, 2022

  42. [42]

    Token activation map to visually explain multimodal llms

    Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, and Xiaomeng Li. Token activation map to visually explain multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 48–58, 2025

  43. [43]

    Pay attention to mlps.Advances in Neural Information Processing Systems, 34:9204–9215, 2021

    Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps.Advances in Neural Information Processing Systems, 34:9204–9215, 2021

  44. [44]

    Self-supervised learning of intertwined content and positional features for object detection.International Conference on Machine Learning, 267:39552–39567, 2025

    Kang Jun Liu, Masanori Suganuma, and Takayuki Okatani. Self-supervised learning of intertwined content and positional features for object detection.International Conference on Machine Learning, 267:39552–39567, 2025

  45. [45]

    An intriguing failing of convolutional neural networks and the CoordConv solution.Advances in Neural Information Processing Systems, 31, 2018

    Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and the CoordConv solution.Advances in Neural Information Processing Systems, 31, 2018

  46. [46]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021

  47. [47]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022

  48. [48]

    Object-centric learning with slot attention.Advances in Neural Information Processing Systems, 33:11525–11538, 2020

    Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention.Advances in Neural Information Processing Systems, 33:11525–11538, 2020

  49. [49]

    Image as set of points

    Xu Ma, Yuqian Zhou, Huan Wang, Can Qin, Bin Sun, Chang Liu, and Yun Fu. Image as set of points. International Conference on Learning Representations, 2023

  50. [50]

    David Milner and Melvyn A

    A. David Milner and Melvyn A. Goodale.The Visual Brain in Action. Oxford University Press, Oxford, 1995

  51. [51]

    Simple open- vocabulary object detection

    Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Doso- vitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open- vocabulary object detection. InEuropean Conference on Computer Vision, pages 728–755. Springer, 2022

  52. [52]

    Object vision and spatial vision: two cortical pathways.Trends in neurosciences, 6:414–417, 1983

    Mortimer Mishkin, Leslie G Ungerleider, and Kathleen A Macko. Object vision and spatial vision: two cortical pathways.Trends in neurosciences, 6:414–417, 1983

  53. [53]

    Cross-stitch networks for multi-task learning

    Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3994–4003, 2016

  54. [54]

    DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024. 12

  55. [55]

    Keep it SimPool: Who said supervised transformers suffer from attention deficit? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5350–5360, 2023

    Bill Psomas, Ioannis Kakogeorgiou, Konstantinos Karantzalos, and Yannis Avrithis. Keep it SimPool: Who said supervised transformers suffer from attention deficit? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5350–5360, 2023

  56. [56]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021

  57. [57]

    MOST: Multiple object localization with self-supervised transformers for object discovery

    Sai Saketh Rambhatla, Ishan Misra, Rama Chellappa, and Abhinav Shrivastava. MOST: Multiple object localization with self-supervised transformers for object discovery. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15823–15834, 2023

  58. [58]

    Vision transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12179–12188, 2021

  59. [59]

    Finding distributed object-centric properties in self-supervised transformers

    Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, and Narendra Ahuja. Finding distributed object-centric properties in self-supervised transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  60. [60]

    Dynamic routing between capsules

    Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. InAdvances in Neural Information Processing Systems, 2017

  61. [61]

    Bridging the gap to real-world object-centric learning.International Conference on Learning Representations, 2023

    Maximilian Seitzer et al. Bridging the gap to real-world object-centric learning.International Conference on Learning Representations, 2023

  62. [62]

    Vision transformers need more than registers.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

    Cheng Shi, Yizhou Yu, and Sibei Yang. Vision transformers need more than registers.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  63. [63]

    Localizing objects with self-supervised transformers and no labels

    Oriane Siméoni, Gilles Puy, Huy V V o, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels. 2021

  64. [64]

    Two-stream convolutional networks for action recognition in videos.Advances in Neural Information Processing Systems, 27, 2014

    Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos.Advances in Neural Information Processing Systems, 27, 2014

  65. [65]

    Illiterate DALL-E learns to compose

    Gautam Singh, Fei Deng, and Sungjin Ahn. Illiterate DALL-E learns to compose. InInternational Conference on Learning Representations, 2022

  66. [66]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015

  67. [67]

    Rethinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016

  68. [68]

    Resmlp: Feedforward networks for image classification with data-efficient training.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):5314–5321, 2022

    Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. Resmlp: Feedforward networks for image classification with data-efficient training.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):5314–5321, 2022

  69. [69]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. InInternational Conference on Machine Learning, pages 10347–10357, 2021

  70. [70]

    Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

  71. [71]

    Learning bottleneck concepts in image classification

    Bowen Wang, Liangzhi Li, Yuta Nakashima, and Hajime Nagahara. Learning bottleneck concepts in image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10962–10971, 2023

  72. [72]

    Explainable image recognition via enhanced slot-attention based classifier.arXiv preprint arXiv:2407.05616, 2024

    Bowen Wang, Liangzhi Li, Jiahao Zhang, Yuta Nakashima, and Hajime Nagahara. Explainable image recognition via enhanced slot-attention based classifier.arXiv preprint arXiv:2407.05616, 2024

  73. [73]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 568–578, 2021. 13

  74. [74]

    Self- supervised transformers for unsupervised object discovery using normalized cut

    Yangtao Wang, Xi Shen, Shell Xu Hu, Yuan Yuan, James L Crowley, and Dominique Vaufreydaz. Self- supervised transformers for unsupervised object discovery using normalized cut. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14543–14553, 2022

  75. [75]

    Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation

    Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12275–12284, 2020

  76. [76]

    CvT: Introducing convolutions to vision transformers

    Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. CvT: Introducing convolutions to vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22–31, 2021

  77. [77]

    Inverted-attention transformers can learn object representations: Insights from slot attention

    Yi-Fu Wu, Klaus Greff, Gamaleldin Fathy Elsayed, Michael Curtis Mozer, Thomas Kipf, and Sjoerd van Steenkiste. Inverted-attention transformers can learn object representations: Insights from slot attention. In Causal Representation Learning Workshop at NeurIPS, 2023

  78. [78]

    Audiovisual slowfast networks for video recognition.arXiv preprint arXiv:2001.08740, 2020

    Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. Audiovisual slowfast networks for video recognition.arXiv preprint arXiv:2001.08740, 2020

  79. [79]

    Multi-class token transformer for weakly supervised semantic segmentation

    Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid Boussaid, and Dan Xu. Multi-class token transformer for weakly supervised semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4310–4319, 2022

  80. [80]

    Dual vision transformer

    Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, and Tao Mei. Dual vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10870–10882, 2023

Showing first 80 references.