arxiv: 2604.14563 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

Mingqian Ji , Shanshan Zhang , Jian Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords token compressionVision Transformermulti-view 3D detectionpatch size selectionsparse detectorsnuScenesArgoverse 2

0 comments

The pith

SEPatch3D accelerates ViT-based sparse multi-view 3D detectors up to 57% by dynamically adjusting patch sizes according to scene content while keeping detection accuracy comparable to baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that common token compression approaches for Vision Transformer detectors in 3D scenes often remove useful background information and fine details, which hurts detection quality. It introduces SEPatch3D to fix this by choosing smaller patches in crowded or nearby-object scenes and larger ones in empty background areas. Two extra steps then pick the most useful patches and restore lost details across different patch sizes. On standard benchmarks this yields large speed gains without accuracy drops, which matters for making high-performance 3D perception fast enough for real-time use.

Core claim

SEPatch3D dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Spatiotemporal-aware Patch Size Selection assigns small patches to scenes containing nearby objects and large patches to background-dominated scenes. Informative Patch Selection then chooses informative patches for feature refinement, and Cross-Granularity Feature Enhancement injects fine-grained details into selected coarse patches. On the nuScenes and Argoverse 2 validation sets this produces up to 57% faster inference than the StreamPETR baseline and 20% higher efficiency than the prior state-of-the-art ToC3D-faster while preserving comparable detection accuracy.

What carries the argument

Spatiotemporal-aware Patch Size Selection (SPSS), which assigns variable patch sizes based on scene content, together with Informative Patch Selection and Cross-Granularity Feature Enhancement to recover semantic details in the compressed token set.

If this is right

Real-time multi-view 3D detection becomes feasible on hardware with limited compute budgets.
Token compression can be made scene-adaptive instead of uniform without harming downstream 3D tasks.
The same three-stage compression pattern can be applied to other ViT-based detectors that process multi-camera inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may extend naturally to other dense prediction problems such as semantic segmentation or depth estimation that also rely on ViT backbones.
Combining the patch-size logic with quantization or pruning could produce further speed-ups on the same models.
The emphasis on preserving background context suggests that pure foreground-focused compression may be suboptimal for 3D scene understanding in general.

Load-bearing premise

The chosen patch sizes and detail-recovery steps will keep all information that the 3D detector actually needs for accurate object localization and classification.

What would settle it

An experiment on nuScenes showing that SEPatch3D drops mean average precision below the StreamPETR baseline when tested on scenes dominated by small or distant objects.

Figures

Figures reproduced from arXiv: 2604.14563 by Jian Yang, Mingqian Ji, Shanshan Zhang.

**Figure 2.** Figure 2: Accuracy-latency under an input resolution of 640 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our method. It consists of two key stages: (1) Dynamic Dual Patch Embedding (Sec. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the proposed Informative Patch Selection [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Budget-aware analysis of accuracy-latency under the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 1.** Figure 1: Background patches contain useful hard negative information for 3D detection. Impact of patch size diversity on detection performance in response to: “why do we only use two patch sizes instead of more?”. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗

**Figure 9.** Figure 9: Impact of patch size diversity on detection performance. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 12.** Figure 12: More visualizations of informative regions and feature [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 13.** Figure 13: Visualization of detection results between StreamPETR [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 11.** Figure 11: Visualization of the patch size selection process in the [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

read the original abstract

Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to \textbf{57\%} faster inference than the StreamPETR baseline and \textbf{20\%} higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy. Code is available at https://github.com/Mingqj/SEPatch3D.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SEPatch3D, a token compression framework for accelerating ViT-based sparse multi-view 3D object detectors. It introduces three components: Spatiotemporal-aware Patch Size Selection (SPSS) that dynamically chooses patch sizes based on scene content (small patches for nearby objects, large for background), Informative Patch Selection (IPS) to refine selected patches, and Cross-Granularity Feature Enhancement (CGFE) to inject fine-grained details into coarse patches. On nuScenes and Argoverse 2 validation sets, it reports up to 57% faster inference than StreamPETR and 20% higher efficiency than ToC3D-faster while maintaining comparable mAP and NDS detection accuracy. Code is released publicly.

Significance. If the empirical results hold, the work offers a practical improvement for real-time multi-view 3D detection in autonomous driving by mitigating information loss in prior token pruning/merging approaches. Strengths include end-to-end latency measurements on consistent hardware, ablation tables that isolate the contribution of SPSS, IPS, and CGFE, and public code release, which aid reproducibility and allow independent verification of the speed-accuracy trade-off claims.

major comments (2)

[§4.2, Table 2] §4.2 and Table 2: The mAP/NDS results are reported as single point estimates without error bars, standard deviations, or results across multiple random seeds. This weakens the claim that accuracy is 'comparable' to baselines, as it is unclear whether small observed differences fall within statistical noise.
[§3.2] §3.2: The SPSS module relies on 'patch size selection thresholds' (noted as free parameters in the axiom ledger), but the manuscript does not specify how these thresholds are chosen, whether they are fixed across datasets, or how they are tuned. This affects reproducibility and the generality of the spatiotemporal-aware selection.

minor comments (2)

[Figure 4] Figure 4: The efficiency-accuracy trade-off plot would be clearer with explicit annotation of the exact hardware (e.g., GPU model) and batch size used for all timing measurements.
[§4.1] §4.1: The implementation details for the new modules (e.g., exact feature injection in CGFE) are high-level; pseudocode or a small diagram would improve clarity for readers attempting to reimplement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and rigor of our work. We address each major comment point by point below.

read point-by-point responses

Referee: [§4.2, Table 2] §4.2 and Table 2: The mAP/NDS results are reported as single point estimates without error bars, standard deviations, or results across multiple random seeds. This weakens the claim that accuracy is 'comparable' to baselines, as it is unclear whether small observed differences fall within statistical noise.

Authors: We acknowledge the value of statistical reporting for robustness. Our experiments followed the standard single-run protocol with a fixed seed used in prior works such as StreamPETR and ToC3D to ensure direct comparability. The accuracy differences remain consistent across our ablation studies. In the revised manuscript we will add results from three random seeds with mean and standard deviation in Table 2 and Section 4.2. revision: yes
Referee: [§3.2] §3.2: The SPSS module relies on 'patch size selection thresholds' (noted as free parameters in the axiom ledger), but the manuscript does not specify how these thresholds are chosen, whether they are fixed across datasets, or how they are tuned. This affects reproducibility and the generality of the spatiotemporal-aware selection.

Authors: The thresholds are set by analyzing the object-distance histogram on the nuScenes training set, using a fixed cutoff of 20 m to separate nearby objects (small patches) from background (large patches). The same values are applied unchanged to Argoverse 2. We will add the exact threshold values, the histogram-based derivation, and a note on cross-dataset usage in the revised Section 3.2. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical proposal with independent validation

full rationale

The paper presents SEPatch3D as an empirical engineering contribution consisting of three new modules (SPSS, IPS, CGFE) whose value is demonstrated through direct runtime measurements and accuracy comparisons against external baselines (StreamPETR, ToC3D-faster) on nuScenes and Argoverse 2. No equations, predictions, or first-principles derivations appear that reduce by construction to fitted parameters, self-citations, or renamed inputs. Ablation tables isolate each component's contribution on held-out validation splits, and efficiency claims are end-to-end hardware measurements rather than analytic predictions. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim depends on empirical validation of the new modules rather than formal derivation; any patch-size decision thresholds or selection criteria are likely tuned parameters whose values are not specified in the abstract.

free parameters (1)

Patch size selection thresholds
The decision rules in SPSS for assigning small vs. large patches based on scene content are not derived from first principles and require empirical tuning.

pith-pipeline@v0.9.0 · 5565 in / 1131 out tokens · 55288 ms · 2026-05-10T11:27:22.744277+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 14 canonical work pages · 3 internal anchors

[1]

Entrope: Entropy-guided dynamic patch encoder for time series forecasting.arXiv preprint arXiv:2509.26157, 2025

Sachith Abeywickrama, Emadeldeen Eldele, Min Wu, Xi- aoli Li, and Chau Yuen. Entrope: Entropy-guided dynamic patch encoder for time series forecasting.arXiv preprint arXiv:2509.26157, 2025. 5

work page arXiv 2025
[2]

Token cropr: Faster vits for quite a few tasks

Benjamin Bergner, Christoph Lippert, and Aravindh Mahen- dran. Token cropr: Faster vits for quite a few tasks. InCVPR,
[3]

Flexivit: One model for all patch sizes

Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. Flexivit: One model for all patch sizes. In CVPR, 2023. 4, 1

2023
[4]

Multi-scale and token mergence: Make your vit more effi- cient.arXiv preprint arXiv:2306.04897, 2023

Zhe Bian, Zhe Wang, Wenqiang Han, and Kangping Wang. Multi-scale and token mergence: Make your vit more effi- cient.arXiv preprint arXiv:2306.04897, 2023. 3

work page arXiv 2023
[5]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 3

work page internal anchor Pith review arXiv 2022
[6]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InCVPR, 2020. 2, 5

2020
[7]

MMDetection3D: Open- MMLab next-generation platform for general 3D object de- tection, 2020

MMDetection3D Contributors. MMDetection3D: Open- MMLab next-generation platform for general 3D object de- tection, 2020. 5

2020
[8]

Heatvit: Hardware-efficient adaptive token pruning for vision transformers

Peiyan Dong, Mengshu Sun, Alec Lu, Yanyue Xie, Ken- neth Liu, Zhenglun Kong, Xin Meng, Zhengang Li, Xue Lin, Zhenman Fang, et al. Heatvit: Hardware-efficient adaptive token pruning for vision transformers. InHPCA, 2023. 3

2023
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

Eva-02: A visual representation for neon genesis.Image and Vision Computing, 2024

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 2024. 4, 5

2024
[11]

Bhvit: Binarized hy- brid vision transformer

Tian Gao, Yu Zhang, Zhiyuan Zhang, Huajun Liu, Kaijie Yin, Chengzhong Xu, and Hui Kong. Bhvit: Binarized hy- brid vision transformer. InCVPR, 2025. 3

2025
[12]

Delving into se- quential patches for deepfake detection

Jiazhi Guan, Hang Zhou, Zhibin Hong, Errui Ding, Jingdong Wang, Chengbin Quan, and Youjian Zhao. Delving into se- quential patches for deepfake detection. InNeurIPS, 2022. 5

2022
[13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,
[14]

Open: Object-wise position embedding for multi-view 3d object de- tection

Jinghua Hou, Tong Wang, Xiaoqing Ye, Zhe Liu, Shi Gong, Xiao Tan, Errui Ding, Jingdong Wang, and Xiang Bai. Open: Object-wise position embedding for multi-view 3d object de- tection. InECCV, 2024. 1

2024
[15]

arXiv preprint arXiv:2203.17054 (2022)

Junjie Huang and Guan Huang. Bevdet4d: Exploit tempo- ral cues in multi-camera 3d object detection.arXiv preprint arXiv:2203.17054, 2022. 1

work page arXiv 2022
[16]

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144, 2016. 5

work page internal anchor Pith review arXiv 2016
[17]

Far3d: Expanding the horizon for surround-view 3d object detec- tion

Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, and Xiangyu Zhang. Far3d: Expanding the horizon for surround-view 3d object detec- tion. InAAAI, 2024. 6, 1

2024
[18]

Token fusion: Bridging the gap between token pruning and token merging

Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging. InWACV, 2024. 1

2024
[19]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, and Wan-Yen Lo. Segment any- thing. InICCV, 2023. 7

2023
[20]

An energy and gpu-computation efficient backbone network for real-time object detection

Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. An energy and gpu-computation efficient backbone network for real-time object detection. In CVPR workshops, 2019. 6

2019
[21]

Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo

Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, and Zeming Li. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In AAAI, 2023. 6

2023
[22]

Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion

Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion. InAAAI, 2023. 1

2023
[23]

Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InECCV, 2022. 1

2022
[24]

Ashsr: Enhanc- ing query-based occupancy prediction via anti-occlusion sampling and hard sample reweighting.Neurocomputing,

Zhihao Li, Shanshan Zhang, and Jian Yang. Ashsr: Enhanc- ing query-based occupancy prediction via anti-occlusion sampling and hard sample reweighting.Neurocomputing,
[25]

Not all patches are what you need: Expediting vision transformers via token reorganizations

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganiza- tions.arXiv preprint arXiv:2202.07800, 2022. 2

work page arXiv 2022
[26]

arXiv preprint arXiv:2211.10581 (2022)

Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d: Multi-view 3d object detec- tion with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022. 1, 3, 6

work page arXiv 2022
[27]

Sparse4d v2: Recurrent temporal fusion with sparse model.arXiv preprint arXiv:2305.14018, 2023

Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d v2: Recurrent temporal fusion with sparse model.arXiv preprint arXiv:2305.14018, 2023. 6, 7

work page arXiv 2023
[28]

Not all tokens are what you need for pretraining

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, Weizhu Chen, et al. Not all tokens are what you need for pretraining. InNeurIPS, 2024. 1

2024
[29]

Ray denoising: Depth-aware hard negative sampling for multi- view 3d object detection

Feng Liu, Tengteng Huang, Qianjing Zhang, Haotian Yao, Chi Zhang, Fang Wan, Qixiang Ye, and Yanzhao Zhou. Ray denoising: Depth-aware hard negative sampling for multi- view 3d object detection. InECCV, 2024. 1

2024
[30]

Sparsebev: High-performance sparse 3d object detec- tion from multi-camera videos

Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object detec- tion from multi-camera videos. InICCV, 2023. 3, 6

2023
[31]

Petr: Position embedding transformation for multi-view 3d object detection

Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. InECCV, 2022. 3, 6

2022
[32]

Revisiting token pruning for object detection and instance segmentation

Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Can- nici, and Davide Scaramuzza. Revisiting token pruning for object detection and instance segmentation. InWACV, 2024. 1

2024
[33]

Time will tell: New outlooks and a baseline for temporal multi- view 3d object detection

Jinhyung Park, Chenfeng Xu, Shijia Yang, Kurt Keutzer, Kris M Kitani, Masayoshi Tomizuka, and Wei Zhan. Time will tell: New outlooks and a baseline for temporal multi- view 3d object detection. InICLR, 2022. 6

2022
[34]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019. 5

2019
[35]

Dynamicvit: Efficient vision transformers with dynamic token sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. InNeurIPS,
[36]

Shunted self-attention via multi-scale token aggregation

Sucheng Ren, Daquan Zhou, Shengfeng He, Jiashi Feng, and Xinchao Wang. Shunted self-attention via multi-scale token aggregation. InCVPR, 2022. 3

2022
[37]

Learn- ing to merge tokens in vision transformers.arXiv preprint arXiv:2202.12015, 2022

Cedric Renggli, Andr ´e Susano Pinto, Neil Houlsby, Basil Mustafa, Joan Puigcerver, and Carlos Riquelme. Learn- ing to merge tokens in vision transformers.arXiv preprint arXiv:2202.12015, 2022. 3

work page arXiv 2022
[38]

Sparse detr: Efficient end-to-end object detection with learnable sparsity.arXiv preprint arXiv:2111.14330, 2021

Byungseok Roh, JaeWoong Shin, Wuhyun Shin, and Sae- hoon Kim. Sparse detr: Efficient end-to-end object detection with learnable sparsity.arXiv preprint arXiv:2111.14330,

work page arXiv
[39]

3dppe: 3d point positional encoding for transformer-based multi-camera 3d object detection

Changyong Shu, Jiajun Deng, Fisher Yu, and Yifan Liu. 3dppe: 3d point positional encoding for transformer-based multi-camera 3d object detection. InICCV, 2023. 3, 6

2023
[40]

Linear regres- sion.Wiley Interdisciplinary Reviews: Computational Statis- tics, 2012

Xiaogang Su, Xin Yan, and Chih-Ling Tsai. Linear regres- sion.Wiley Interdisciplinary Reviews: Computational Statis- tics, 2012. 4

2012
[41]

Pv-mm3d: Point-voxel parallel dual-stream framework with dual-attention region adaptive fusion for multimodal 3d object detection.Infor- mation Fusion, 2025

Baotong Wang, Chenxing Xia, Xiuju Gao, Yuan Yang, Bin Ge, Kuan-Ching Li, and Yan Zhang. Pv-mm3d: Point-voxel parallel dual-stream framework with dual-attention region adaptive fusion for multimodal 3d object detection.Infor- mation Fusion, 2025. 1

2025
[42]

Exploring object-centric temporal modeling for efficient multi-view 3d object detection

Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xi- angyu Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. InICCV, 2023. 2, 3, 4, 5, 6, 7, 8, 1

2023
[43]

Detr3d: 3d object detection from multi-view images via 3d-to-2d queries

Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. InRL, 2022. 1, 3, 6, 7

2022
[44]

Argoverse 2: Next generation datasets for self-driving perception and fore- casting

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lam- bert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Rat- nesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and fore- casting. InNeurIPS, 2021. 2, 5

2021
[45]

Cvt: Introducing con- volutions to vision transformers

Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing con- volutions to vision transformers. InICCV, 2021. 3

2021
[46]

See through the dark: Learning illumination-affined representations for nighttime occupancy prediction.arXiv preprint arXiv:2505.20641, 2025

Yuan Wu, Zhiqiang Yan, Yigong Zhang, Xiang Li, and Jian Yang. See through the dark: Learning illumination-affined representations for nighttime occupancy prediction.arXiv preprint arXiv:2505.20641, 2025. 1

work page arXiv 2025
[47]

Accelerate 3d object detection models via zero- shot attention key pruning

Lizhen Xu, Xiuxiu Bai, Xiaojun Jia, Jianwu Fang, and Shan- min Pang. Accelerate 3d object detection models via zero- shot attention key pruning. InICCV, 2025. 3, 6

2025
[48]

Uncertainty guided test-time training for face forgery detec- tion.Computer Vision and Image Understanding, 2024

Pengxiang Xu, Yang He, Jian Yang, and Shanshan Zhang. Uncertainty guided test-time training for face forgery detec- tion.Computer Vision and Image Understanding, 2024. 5

2024
[49]

Evo-vit: Slow-fast token evolution for dynamic vision transformer

Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-vit: Slow-fast token evolution for dynamic vision transformer. InAAAI, 2022. 2

2022
[50]

Tokens-to-token vit: Training vision transformers from scratch on imagenet

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. InICCV, 2021. 3

2021
[51]

Token transforming: A unified and training-free token compres- sion framework for vision transformer acceleration.arXiv preprint arXiv:2506.05709, 2025

Fanhu Zeng, Deli Yu, Zhenglun Kong, and Hao Tang. Token transforming: A unified and training-free token compres- sion framework for vision transformer acceleration.arXiv preprint arXiv:2506.05709, 2025. 1

work page arXiv 2025
[52]

Not all tokens are equal: Human-centric visual analysis via token clustering transformer

Wang Zeng, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang, and Xiaogang Wang. Not all tokens are equal: Human-centric visual analysis via token clustering transformer. InCVPR, 2022. 3

2022
[53]

Tcformer: visual recognition via token clustering transformer.IEEE TPAMI, 2024

Wang Zeng, Sheng Jin, Lumin Xu, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, and Xiaogang Wang. Tcformer: visual recognition via token clustering transformer.IEEE TPAMI, 2024. 1

2024
[54]

Make your vit-based multi-view 3d detectors faster via token com- pression

Dingyuan Zhang, Dingkang Liang, Zichang Tan, Xiaoqing Ye, Cheng Zhang, Jingdong Wang, and Xiang Bai. Make your vit-based multi-view 3d detectors faster via token com- pression. InECCV, 2024. 2, 3, 5, 6, 7

2024
[55]

Topformer: Token pyramid transformer for mobile semantic segmentation

Wenqiang Zhang, Zilong Huang, Guozhong Luo, Tao Chen, Xinggang Wang, Wenyu Liu, Gang Yu, and Chunhua Shen. Topformer: Token pyramid transformer for mobile semantic segmentation. InCVPR, 2022. 3

2022
[56]

Image recognition with on- line lightweight vision transformer: a survey.arXiv preprint arXiv:2505.03113, 2025

Zherui Zhang, Rongtao Xu, Jie Zhou, Changwei Wang, Xingtian Pei, Wenhao Xu, Jiguang Zhang, Li Guo, Longx- iang Gao, Wenbo Xu, et al. Image recognition with on- line lightweight vision transformer: a survey.arXiv preprint arXiv:2505.03113, 2025. 2

work page arXiv 2025
[57]

World4drive: End-to-end au- tonomous driving via intention-aware physical latent world model

Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end au- tonomous driving via intention-aware physical latent world model. InICCV, 2025. 3

2025