arxiv: 2605.12021 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

Ryota Yoshihashi , Masahiro Kada , Satoshi Ikehata , Rei Kawakami , Ikuro Sato

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords what-where separationvision transformerobject discoveryattention mapsweakly supervised segmentationslot-based architecturelocalizationinductive bias

0 comments

The pith

What-Where Transformer separates object appearance from location in concurrent streams to produce emergent multi-object discovery from raw attention maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an inductive bias called what-where separation inside a Vision Transformer backbone. It treats tokens as representations of object appearance and attention maps as representations of spatial location, then routes them through separate concurrent feed-forward modules in a slot-based design. This decomposition lets both the tokens and the maps receive direct gradients from task losses, so localization information is learned explicitly rather than suppressed. A sympathetic reader would care because standard classification backbones often entangle or discard location cues, making downstream tasks like discovery and segmentation harder; the separation offers a way to obtain both kinds of information from the same forward pass without extra supervision or post-processing.

Core claim

By processing tokens as what-representations and attention maps as where-representations in concurrent feed-forward modules of a multi-stream slot-based architecture, the What-Where Transformer achieves what-where separation throughout an attentive backbone. The final-layer tokens and attention maps are reused directly for downstream tasks and exposed to task-loss gradients, enabling effective localization learning. Even when trained only with single-label classification supervision on ImageNet, the model exhibits emergent multiple object discovery directly from its raw attention maps without token clustering or other post-processing.

What carries the argument

A multi-stream, slot-based architecture that processes tokens (what-representations) and attention maps (where-representations) in concurrent feed-forward modules.

If this is right

Achieves higher performance than ViT-based methods on zero-shot object discovery.
Outperforms prior approaches on weakly supervised semantic segmentation.
Transfers to multiple localization setups with only minimal architectural changes.
Produces multiple object discovery directly from raw attention maps without clustering or other post-processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation could simplify end-to-end pipelines for dense prediction tasks by removing the need for separate localization heads or clustering stages.
Because the maps are already exposed to gradients, the model might support fine-grained localization even when only coarse labels are available during training.
The concurrent what-where streams might be combined with existing object-centric models to improve slot binding without changing the supervision regime.

Load-bearing premise

That treating tokens and attention maps as separate what and where streams in concurrent modules will keep the two kinds of information from entangling and will allow localization to be learned from task losses alone.

What would settle it

Train the model on standard single-label ImageNet classification and check whether the raw final-layer attention maps contain spatially distinct activations for multiple separate objects in the same image; failure to observe such activations would falsify the emergence claim.

Figures

Figures reproduced from arXiv: 2605.12021 by Ikuro Sato, Masahiro Kada, Rei Kawakami, Ryota Yoshihashi, Satoshi Ikehata.

**Figure 3.** Figure 3: Illustration of ViT’s tokentoken connectivity and WWT’s token-slot connectivity. The WWT network is configured as a stack of WWT blocks. We omit spatial hierarchy or gradual downsampling for simplicity and to maintain the original-resolution attentions. In the first WWT block of the network, the initial tokens are obtained by applying a linear transformation to the RGB values of each patch. The initial … view at source ↗

**Figure 4.** Figure 4: Task heads utilizing slot-mask representations from WWT (see Sec. 3.2). [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of per-slot output masks. Red (a): Class-bound slots attend to semantically [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Translation invariance of patch tokens and slots. The results in Tab. 5 suggest that WWT performed maskbased discovery comparably with the SA-based baseline method, i.e., DINOSAUR, in the distillation-based setting. While some SA-based methods benefit from stronger autoregressive Transformer decoders, WWT enables object centricity within the backbone itself. Its ability to perform this task without addi… view at source ↗

read the original abstract

Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One possible reason is that classification-oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where. In this work, we focus on an inductive bias termed what-where separation, which encourages models to represent object appearance and spatial location in a decomposed manner. To incorporate this bias throughout an attentive backbone in the style of Vision Transformer (ViT), we propose the What-Where Transformer (WWT). Our method introduces two key novel designs: (1) it treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed-forward modules via a multi-stream, slot-based architecture; (2) it reuses both the final-layer tokens and attention maps for downstream tasks, and directly exposes them to gradients derived from task losses, thereby facilitating more effective and explicit learning of localization. We demonstrate that even under standard single-label classification-based supervision on ImageNet, WWT exhibits emergent multiple object discovery directly from raw attention maps, rather than via additional postprocessing such as token clustering. Furthermore, WWT achieves superior performance compared to ViT-based methods on zero-shot object discovery and weakly supervised semantic segmentation, and it is transferable to various localization setups with minimal modifications. Code will be published after acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WWT adds a concurrent slot-based split between token what-reps and attention where-maps inside a ViT backbone, with some reported gains on zero-shot discovery and weak segmentation, but the no-postprocessing claim needs tighter verification.

read the letter

The core idea is straightforward: run tokens and attention maps through separate feed-forward streams in a multi-slot setup, then expose both to the task losses so localization gets explicit gradients. That is the main novelty over standard ViT or slot-attention baselines, and it is a clean way to push the what-where inductive bias deeper into the network rather than adding it later.

Referee Report

3 major / 2 minor

Summary. The paper proposes the What-Where Transformer (WWT), a slot-centric Vision Transformer variant that enforces an inductive bias for what-where separation. Tokens are treated as what-representations and attention maps as where-representations; these are processed in concurrent multi-stream feed-forward modules. The final-layer tokens and attention maps are reused for downstream tasks and directly optimized by task losses. The central empirical claim is that, under standard single-label ImageNet classification supervision, WWT produces emergent multiple-object discovery directly from raw final-layer attention maps without token clustering or other post-processing, while also improving zero-shot object discovery and weakly-supervised semantic segmentation relative to ViT baselines and transferring to other localization tasks.

Significance. If the empirical claims are substantiated with proper controls, the work would be moderately significant for vision backbones: it offers a concrete architectural mechanism to reduce entanglement between semantic and spatial information without requiring explicit localization supervision or auxiliary losses. The reported transferability to multiple localization setups and the avoidance of post-processing steps would be useful if the separation is shown to be robust rather than an artifact of the slot design.

major comments (3)

[Abstract, §4] Abstract and §4 (experimental results): The claim of 'emergent multiple object discovery directly from raw attention maps' without post-processing is load-bearing for the novelty argument, yet the manuscript does not specify the exact extraction procedure (e.g., whether per-head selection, averaging, or simple thresholding is applied before visualization or metric computation). If any such step is used, it must be shown to be strictly weaker than the token-clustering baselines it is contrasted against; otherwise the separation advantage is not cleanly demonstrated.
[§3.2] §3.2 (architecture) and ablation studies: The concurrent what/where feed-forward modules are presented as the source of clean decomposition, but no direct ablation compares WWT against a standard ViT with identical slot count and attention-map reuse under the same ImageNet supervision. Without this control, it remains unclear whether the observed localization gains arise from the what-where split or simply from the multi-stream slot architecture.
[Table 2, Table 3] Table 2 (zero-shot discovery) and Table 3 (weakly-supervised segmentation): Performance numbers are reported without standard deviations across multiple runs or seeds, and the baselines appear to use the same ViT backbone without the concurrent modules. This makes it difficult to assess whether the reported gains are statistically reliable or attributable to the proposed separation rather than hyper-parameter differences.

minor comments (2)

[§3] Notation for the slot streams and the reuse of attention maps for gradient flow should be introduced with a single diagram and consistent symbols in §3; current prose descriptions are occasionally ambiguous about which tensors receive task gradients.
The manuscript states that code will be released after acceptance; adding a reproducibility checklist (data splits, exact hyper-parameters, and the precise attention-map extraction code) would strengthen the empirical claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, providing clarifications and committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (experimental results): The claim of 'emergent multiple object discovery directly from raw attention maps' without post-processing is load-bearing for the novelty argument, yet the manuscript does not specify the exact extraction procedure (e.g., whether per-head selection, averaging, or simple thresholding is applied before visualization or metric computation). If any such step is used, it must be shown to be strictly weaker than the token-clustering baselines it is contrasted against; otherwise the separation advantage is not cleanly demonstrated.

Authors: We will revise the manuscript to explicitly detail the extraction procedure. The final-layer attention maps are used in their raw form for both visualization and quantitative metrics (e.g., object discovery evaluation), with only standard multi-head averaging applied as is conventional in ViT attention analysis—no per-head selection, thresholding, clustering, or other post-processing steps. This procedure is indeed minimal and weaker than the token-clustering baselines we compare against, directly supporting the emergent separation claim. Updated description and examples will be added to §4 and the appendix. revision: yes
Referee: [§3.2] §3.2 (architecture) and ablation studies: The concurrent what/where feed-forward modules are presented as the source of clean decomposition, but no direct ablation compares WWT against a standard ViT with identical slot count and attention-map reuse under the same ImageNet supervision. Without this control, it remains unclear whether the observed localization gains arise from the what-where split or simply from the multi-stream slot architecture.

Authors: This is a fair point on isolating the contribution of the concurrent modules. Standard ViT lacks native slot-centric processing and direct attention-map reuse, so a perfect 1:1 control is not straightforward. However, we will add a new ablation in the revised §3.2 and experiments comparing WWT to a merged single-stream slot variant (same slot count, attention reuse, and supervision) to isolate the effect of the what/where split. This will clarify that the gains stem from the concurrent design rather than slots alone. revision: partial
Referee: [Table 2, Table 3] Table 2 (zero-shot discovery) and Table 3 (weakly-supervised segmentation): Performance numbers are reported without standard deviations across multiple runs or seeds, and the baselines appear to use the same ViT backbone without the concurrent modules. This makes it difficult to assess whether the reported gains are statistically reliable or attributable to the proposed separation rather than hyper-parameter differences.

Authors: We agree that standard deviations would enhance statistical reliability. Due to compute limits in the original runs, we reported single-run results, but we will re-execute the key experiments across 3 seeds and update Tables 2 and 3 with means ± std. Baselines were reimplemented under matched hyperparameters and training protocols where feasible; we will add explicit notes on any minor differences in the text and appendix to rule out confounds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on explicit architectural inductive bias rather than self-referential fits or citations.

full rationale

The paper defines WWT via two explicit design choices—treating tokens as what-representations and attention maps as where-representations in concurrent slot-based feed-forward modules, plus direct exposure of both to task-loss gradients—without any equation that reduces the claimed what-where separation or emergent discovery to a quantity fitted from the same data or imported via self-citation. The abstract presents the multiple-object discovery result as an empirical outcome under standard ImageNet supervision, not as a prediction derived from the architecture's own fitted parameters. No self-citation load-bearing step, uniqueness theorem, or ansatz smuggling appears in the derivation chain; the central separation is an imposed inductive bias whose effectiveness is evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the architecture is described at the level of inductive bias and module organization without numerical fitting details or unstated background assumptions.

pith-pipeline@v0.9.0 · 5586 in / 1139 out tokens · 52163 ms · 2026-05-13T06:55:01.964879+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed-forward modules via a multi-stream, slot-based architecture
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

WWT exhibits emergent multiple object discovery directly from raw attention maps

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 1 internal anchor

[1]

Quantifying attention flow in transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InAnnual Meeting of the Association for Computational Linguistics, pages 4190–4197, 2020

work page 2020
[2]

MONet: Unsupervised Scene Decomposition and Representation

Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. NONet: Unsupervised scene decomposition and representation.arXiv preprint arXiv:1901.11390, 2019

work page Pith review arXiv 1901
[3]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean Conference on Computer Vision, pages 213–229, 2020

work page 2020
[4]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021

work page 2021
[5]

Weakly-supervised semantic segmentation via sub-category exploration

Yu-Ting Chang, Qiaosong Wang, Wei-Chih Hung, Robinson Piramuthu, Yi-Hsuan Tsai, and Ming-Hsuan Yang. Weakly-supervised semantic segmentation via sub-category exploration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8991–9000, 2020

work page 2020
[6]

Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. InWinter Conference on Applications of Computer Vision, pages 839–847. IEEE, 2018

work page 2018
[7]

Mobile-former: Bridging mobilenet and transformer

Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. Mobile-former: Bridging mobilenet and transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5270–5279, 2022

work page 2022
[8]

Dual path networks

Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[9]

Siamese DETR

Zeren Chen, Gengshi Huang, Wei Li, Jianing Teng, Kun Wang, Jing Shao, Chen Change Loy, and Lu Sheng. Siamese DETR. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15722–15731, 2023

work page 2023
[10]

Masked- attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked- attention mask transformer for universal image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022

work page 2022
[11]

Per-pixel classification is not all you need for semantic segmentation.Advances in Neural Information Processing Systems, 34:17864–17875, 2021

Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation.Advances in Neural Information Processing Systems, 34:17864–17875, 2021

work page 2021
[12]

A dual-stream neural network explains the functional segregation of dorsal and ventral visual pathways in human brains

Minkyu Choi, Kuan Han, Xiaokai Wang, Yizhen Zhang, and Zhongming Liu. A dual-stream neural network explains the functional segregation of dorsal and ventral visual pathways in human brains. In Advances in Neural Information Processing Systems, 2023

work page 2023
[13]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.arXiv preprint, arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

Multi-column deep neural networks for image classification

Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks for image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3642–3649. IEEE, 2012

work page 2012
[15]

Deep feature factorization for concept discovery

Edo Collins, Radhakrishna Achanta, and Sabine Susstrunk. Deep feature factorization for concept discovery. InEuropean Conference on Computer Vision, pages 336–352, 2018

work page 2018
[16]

UP-DETR: Unsupervised pre-training for object detection with transformers

Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. UP-DETR: Unsupervised pre-training for object detection with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1601–1610, 2021. 10

work page 2021
[17]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InInternational Conference on Learning Representations, 2024

work page 2024
[18]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, An- dreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational Conference on Machine Learning, pages 7480– 7512, 2023

work page 2023
[19]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009

work page 2009
[20]

Perceptual group tokenizer: Building perception with iterative grouping.International Conference on Learning Representations, 2024

Zhiwei Deng, Ting Chen, and Yang Li. Perceptual group tokenizer: Building perception with iterative grouping.International Conference on Learning Representations, 2024

work page 2024
[21]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations

work page
[22]

Ventral-dorsal neural networks: object detection via selective attention

Mohammad K Ebrahimpour, Jiayun Li, Yen-Yun Yu, Jackson Reesee, Azadeh Moghtaderi, Ming-Hsuan Yang, and David C Noelle. Ventral-dorsal neural networks: object detection via selective attention. In Winter Conference on Applications of Computer Vision, pages 986–994. IEEE, 2019

work page 2019
[23]

CRAFT: Concept recursive activation factorization for explainability

Thomas Fel, Agustin Picard, Louis Bethune, Thibaut Boissin, David Vigouroux, Julien Colin, Rémi Cadène, and Thomas Serre. CRAFT: Concept recursive activation factorization for explainability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2711–2721, 2023

work page 2023
[24]

Large-scale unsupervised semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7457–7476, 2022

Shanghua Gao, Zhong-Yu Li, Ming-Hsuan Yang, Ming-Ming Cheng, Junwei Han, and Philip Torr. Large-scale unsupervised semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7457–7476, 2022

work page 2022
[25]

Multi-fold MIL training for weakly supervised object localization

Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Multi-fold MIL training for weakly supervised object localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2409–2416, 2014

work page 2014
[26]

what” and “where

Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, and Michael C Mozer. Decoupling the “what” and “where” with polar coordinate positional embeddings.arXiv preprint arXiv:2509.10534, 2025

work page arXiv 2025
[27]

Inductive biases for deep learning of higher-level cognition.Proceed- ings of the Royal Society A, 478(2266):20210068, 2022

Anirudh Goyal and Yoshua Bengio. Inductive biases for deep learning of higher-level cognition.Proceed- ings of the Royal Society A, 478(2266):20210068, 2022

work page 2022
[28]

Emergence of complex-like cells in a temporal product network with local receptive fields.arXiv preprint arXiv:1006.0448, 2010

Karo Gregor and Yann LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields.arXiv preprint arXiv:1006.0448, 2010

work page arXiv 2010
[29]

ViTOL: Vision transformer for weakly supervised object localization

Saurav Gupta, Sourav Lakhotia, Abhay Rawat, and Rahul Tallamraju. ViTOL: Vision transformer for weakly supervised object localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4101–4110, 2022

work page 2022
[30]

Egocentric human activities recognition with multimodal interaction sensing.IEEE Sensors Journal, 24(5):7085–7096, 2024

Yuzhe Hao, Asako Kanezaki, Ikuro Sato, Rei Kawakami, and Koichi Shinoda. Egocentric human activities recognition with multimodal interaction sensing.IEEE Sensors Journal, 24(5):7085–7096, 2024

work page 2024
[31]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016
[32]

Columbia university, 1997

Orris C Herfindahl.Concentration in the steel industry. Columbia university, 1997

work page 1997
[33]

Transforming auto-encoders

Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. InInternational conference on artificial neural networks, pages 44–51. Springer, 2011

work page 2011
[34]

Object-centric slot diffusion.Advances in Neural Information Processing Systems, arXiv:2303.10834, 2023

Jindong Jiang, Fei Deng, Gautam Singh, and Sungjin Ahn. Object-centric slot diffusion.Advances in Neural Information Processing Systems, arXiv:2303.10834, 2023

work page arXiv 2023
[35]

SPOT: Self- training with patch-order permutation for object-centric learning with autoregressive transformers

Ioannis Kakogeorgiou, Spyros Gidaris, Konstantinos Karantzalos, and Nikos Komodakis. SPOT: Self- training with patch-order permutation for object-centric learning with autoregressive transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22776– 22786, 2024. 11

work page 2024
[36]

Cross-connected networks for multi-task learning of detection and segmentation

Rei Kawakami, Ryota Yoshihashi, Seiichiro Fukuda, Shaodi You, Makoto Iida, and Takeshi Naemura. Cross-connected networks for multi-task learning of detection and segmentation. pages 3636–3640. IEEE, 2019

work page 2019
[37]

On permutation- invariant neural networks.arXiv preprint arXiv:2403.17410, 2024

Masanari Kimura, Ryotaro Shimizu, Yuki Hirakawa, Ryosuke Goto, and Yuki Saito. On permutation- invariant neural networks.arXiv preprint arXiv:2403.17410, 2024

work page arXiv 2024
[38]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

work page 2023
[39]

Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation

Jungbeom Lee, Eunji Kim, and Sungroh Yoon. Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021
[40]

Scouter: Slot attention-based classifier for explainable image recognition

Liangzhi Li, Bowen Wang, Manisha Verma, Yuta Nakashima, Ryo Kawasaki, and Hajime Nagahara. Scouter: Slot attention-based classifier for explainable image recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1046–1055, 2021

work page 2021
[41]

Exploring plain vision transformer backbones for object detection

Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. InEuropean Conference on Computer Vision, pages 280–296, 2022

work page 2022
[42]

Token activation map to visually explain multimodal llms

Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, and Xiaomeng Li. Token activation map to visually explain multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 48–58, 2025

work page 2025
[43]

Pay attention to mlps.Advances in Neural Information Processing Systems, 34:9204–9215, 2021

Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps.Advances in Neural Information Processing Systems, 34:9204–9215, 2021

work page 2021
[44]

Self-supervised learning of intertwined content and positional features for object detection.International Conference on Machine Learning, 267:39552–39567, 2025

Kang Jun Liu, Masanori Suganuma, and Takayuki Okatani. Self-supervised learning of intertwined content and positional features for object detection.International Conference on Machine Learning, 267:39552–39567, 2025

work page 2025
[45]

An intriguing failing of convolutional neural networks and the CoordConv solution.Advances in Neural Information Processing Systems, 31, 2018

Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and the CoordConv solution.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[46]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021

work page 2021
[47]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022

work page 2022
[48]

Object-centric learning with slot attention.Advances in Neural Information Processing Systems, 33:11525–11538, 2020

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention.Advances in Neural Information Processing Systems, 33:11525–11538, 2020

work page 2020
[49]

Image as set of points

Xu Ma, Yuqian Zhou, Huan Wang, Can Qin, Bin Sun, Chang Liu, and Yun Fu. Image as set of points. International Conference on Learning Representations, 2023

work page 2023
[50]

David Milner and Melvyn A

A. David Milner and Melvyn A. Goodale.The Visual Brain in Action. Oxford University Press, Oxford, 1995

work page 1995
[51]

Simple open- vocabulary object detection

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Doso- vitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open- vocabulary object detection. InEuropean Conference on Computer Vision, pages 728–755. Springer, 2022

work page 2022
[52]

Object vision and spatial vision: two cortical pathways.Trends in neurosciences, 6:414–417, 1983

Mortimer Mishkin, Leslie G Ungerleider, and Kathleen A Macko. Object vision and spatial vision: two cortical pathways.Trends in neurosciences, 6:414–417, 1983

work page 1983
[53]

Cross-stitch networks for multi-task learning

Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3994–4003, 2016

work page 2016
[54]

DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024. 12

work page 2024
[55]

Keep it SimPool: Who said supervised transformers suffer from attention deficit? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5350–5360, 2023

Bill Psomas, Ioannis Kakogeorgiou, Konstantinos Karantzalos, and Yannis Avrithis. Keep it SimPool: Who said supervised transformers suffer from attention deficit? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5350–5360, 2023

work page 2023
[56]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021

work page 2021
[57]

MOST: Multiple object localization with self-supervised transformers for object discovery

Sai Saketh Rambhatla, Ishan Misra, Rama Chellappa, and Abhinav Shrivastava. MOST: Multiple object localization with self-supervised transformers for object discovery. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15823–15834, 2023

work page 2023
[58]

Vision transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12179–12188, 2021

work page 2021
[59]

Finding distributed object-centric properties in self-supervised transformers

Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, and Narendra Ahuja. Finding distributed object-centric properties in self-supervised transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

work page 2026
[60]

Dynamic routing between capsules

Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[61]

Bridging the gap to real-world object-centric learning.International Conference on Learning Representations, 2023

Maximilian Seitzer et al. Bridging the gap to real-world object-centric learning.International Conference on Learning Representations, 2023

work page 2023
[62]

Vision transformers need more than registers.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

Cheng Shi, Yizhou Yu, and Sibei Yang. Vision transformers need more than registers.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

work page 2026
[63]

Localizing objects with self-supervised transformers and no labels

Oriane Siméoni, Gilles Puy, Huy V V o, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels. 2021

work page 2021
[64]

Two-stream convolutional networks for action recognition in videos.Advances in Neural Information Processing Systems, 27, 2014

Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos.Advances in Neural Information Processing Systems, 27, 2014

work page 2014
[65]

Illiterate DALL-E learns to compose

Gautam Singh, Fei Deng, and Sungjin Ahn. Illiterate DALL-E learns to compose. InInternational Conference on Learning Representations, 2022

work page 2022
[66]

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015

work page 2015
[67]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016

work page 2016
[68]

Resmlp: Feedforward networks for image classification with data-efficient training.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):5314–5321, 2022

Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. Resmlp: Feedforward networks for image classification with data-efficient training.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):5314–5321, 2022

work page 2022
[69]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. InInternational Conference on Machine Learning, pages 10347–10357, 2021

work page 2021
[70]

Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[71]

Learning bottleneck concepts in image classification

Bowen Wang, Liangzhi Li, Yuta Nakashima, and Hajime Nagahara. Learning bottleneck concepts in image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10962–10971, 2023

work page 2023
[72]

Explainable image recognition via enhanced slot-attention based classifier.arXiv preprint arXiv:2407.05616, 2024

Bowen Wang, Liangzhi Li, Jiahao Zhang, Yuta Nakashima, and Hajime Nagahara. Explainable image recognition via enhanced slot-attention based classifier.arXiv preprint arXiv:2407.05616, 2024

work page arXiv 2024
[73]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 568–578, 2021. 13

work page 2021
[74]

Self- supervised transformers for unsupervised object discovery using normalized cut

Yangtao Wang, Xi Shen, Shell Xu Hu, Yuan Yuan, James L Crowley, and Dominique Vaufreydaz. Self- supervised transformers for unsupervised object discovery using normalized cut. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14543–14553, 2022

work page 2022
[75]

Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation

Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12275–12284, 2020

work page 2020
[76]

CvT: Introducing convolutions to vision transformers

Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. CvT: Introducing convolutions to vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22–31, 2021

work page 2021
[77]

Inverted-attention transformers can learn object representations: Insights from slot attention

Yi-Fu Wu, Klaus Greff, Gamaleldin Fathy Elsayed, Michael Curtis Mozer, Thomas Kipf, and Sjoerd van Steenkiste. Inverted-attention transformers can learn object representations: Insights from slot attention. In Causal Representation Learning Workshop at NeurIPS, 2023

work page 2023
[78]

Audiovisual slowfast networks for video recognition.arXiv preprint arXiv:2001.08740, 2020

Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. Audiovisual slowfast networks for video recognition.arXiv preprint arXiv:2001.08740, 2020

work page arXiv 2001
[79]

Multi-class token transformer for weakly supervised semantic segmentation

Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid Boussaid, and Dan Xu. Multi-class token transformer for weakly supervised semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4310–4319, 2022

work page 2022
[80]

Dual vision transformer

Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, and Tao Mei. Dual vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10870–10882, 2023

work page 2023

Showing first 80 references.