Recognition: unknown
Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
Pith reviewed 2026-05-10 11:27 UTC · model grok-4.3
The pith
SEPatch3D accelerates ViT-based sparse multi-view 3D detectors up to 57% by dynamically adjusting patch sizes according to scene content while keeping detection accuracy comparable to baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SEPatch3D dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Spatiotemporal-aware Patch Size Selection assigns small patches to scenes containing nearby objects and large patches to background-dominated scenes. Informative Patch Selection then chooses informative patches for feature refinement, and Cross-Granularity Feature Enhancement injects fine-grained details into selected coarse patches. On the nuScenes and Argoverse 2 validation sets this produces up to 57% faster inference than the StreamPETR baseline and 20% higher efficiency than the prior state-of-the-art ToC3D-faster while preserving comparable detection accuracy.
What carries the argument
Spatiotemporal-aware Patch Size Selection (SPSS), which assigns variable patch sizes based on scene content, together with Informative Patch Selection and Cross-Granularity Feature Enhancement to recover semantic details in the compressed token set.
If this is right
- Real-time multi-view 3D detection becomes feasible on hardware with limited compute budgets.
- Token compression can be made scene-adaptive instead of uniform without harming downstream 3D tasks.
- The same three-stage compression pattern can be applied to other ViT-based detectors that process multi-camera inputs.
Where Pith is reading between the lines
- The method may extend naturally to other dense prediction problems such as semantic segmentation or depth estimation that also rely on ViT backbones.
- Combining the patch-size logic with quantization or pruning could produce further speed-ups on the same models.
- The emphasis on preserving background context suggests that pure foreground-focused compression may be suboptimal for 3D scene understanding in general.
Load-bearing premise
The chosen patch sizes and detail-recovery steps will keep all information that the 3D detector actually needs for accurate object localization and classification.
What would settle it
An experiment on nuScenes showing that SEPatch3D drops mean average precision below the StreamPETR baseline when tested on scenes dominated by small or distant objects.
Figures
read the original abstract
Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to \textbf{57\%} faster inference than the StreamPETR baseline and \textbf{20\%} higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy. Code is available at https://github.com/Mingqj/SEPatch3D.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SEPatch3D, a token compression framework for accelerating ViT-based sparse multi-view 3D object detectors. It introduces three components: Spatiotemporal-aware Patch Size Selection (SPSS) that dynamically chooses patch sizes based on scene content (small patches for nearby objects, large for background), Informative Patch Selection (IPS) to refine selected patches, and Cross-Granularity Feature Enhancement (CGFE) to inject fine-grained details into coarse patches. On nuScenes and Argoverse 2 validation sets, it reports up to 57% faster inference than StreamPETR and 20% higher efficiency than ToC3D-faster while maintaining comparable mAP and NDS detection accuracy. Code is released publicly.
Significance. If the empirical results hold, the work offers a practical improvement for real-time multi-view 3D detection in autonomous driving by mitigating information loss in prior token pruning/merging approaches. Strengths include end-to-end latency measurements on consistent hardware, ablation tables that isolate the contribution of SPSS, IPS, and CGFE, and public code release, which aid reproducibility and allow independent verification of the speed-accuracy trade-off claims.
major comments (2)
- [§4.2, Table 2] §4.2 and Table 2: The mAP/NDS results are reported as single point estimates without error bars, standard deviations, or results across multiple random seeds. This weakens the claim that accuracy is 'comparable' to baselines, as it is unclear whether small observed differences fall within statistical noise.
- [§3.2] §3.2: The SPSS module relies on 'patch size selection thresholds' (noted as free parameters in the axiom ledger), but the manuscript does not specify how these thresholds are chosen, whether they are fixed across datasets, or how they are tuned. This affects reproducibility and the generality of the spatiotemporal-aware selection.
minor comments (2)
- [Figure 4] Figure 4: The efficiency-accuracy trade-off plot would be clearer with explicit annotation of the exact hardware (e.g., GPU model) and batch size used for all timing measurements.
- [§4.1] §4.1: The implementation details for the new modules (e.g., exact feature injection in CGFE) are high-level; pseudocode or a small diagram would improve clarity for readers attempting to reimplement.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help improve the clarity and rigor of our work. We address each major comment point by point below.
read point-by-point responses
-
Referee: [§4.2, Table 2] §4.2 and Table 2: The mAP/NDS results are reported as single point estimates without error bars, standard deviations, or results across multiple random seeds. This weakens the claim that accuracy is 'comparable' to baselines, as it is unclear whether small observed differences fall within statistical noise.
Authors: We acknowledge the value of statistical reporting for robustness. Our experiments followed the standard single-run protocol with a fixed seed used in prior works such as StreamPETR and ToC3D to ensure direct comparability. The accuracy differences remain consistent across our ablation studies. In the revised manuscript we will add results from three random seeds with mean and standard deviation in Table 2 and Section 4.2. revision: yes
-
Referee: [§3.2] §3.2: The SPSS module relies on 'patch size selection thresholds' (noted as free parameters in the axiom ledger), but the manuscript does not specify how these thresholds are chosen, whether they are fixed across datasets, or how they are tuned. This affects reproducibility and the generality of the spatiotemporal-aware selection.
Authors: The thresholds are set by analyzing the object-distance histogram on the nuScenes training set, using a fixed cutoff of 20 m to separate nearby objects (small patches) from background (large patches). The same values are applied unchanged to Argoverse 2. We will add the exact threshold values, the histogram-based derivation, and a note on cross-dataset usage in the revised Section 3.2. revision: yes
Circularity Check
No significant circularity; empirical proposal with independent validation
full rationale
The paper presents SEPatch3D as an empirical engineering contribution consisting of three new modules (SPSS, IPS, CGFE) whose value is demonstrated through direct runtime measurements and accuracy comparisons against external baselines (StreamPETR, ToC3D-faster) on nuScenes and Argoverse 2. No equations, predictions, or first-principles derivations appear that reduce by construction to fitted parameters, self-citations, or renamed inputs. Ablation tables isolate each component's contribution on held-out validation splits, and efficiency claims are end-to-end hardware measurements rather than analytic predictions. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Patch size selection thresholds
Reference graph
Works this paper leans on
-
[1]
Sachith Abeywickrama, Emadeldeen Eldele, Min Wu, Xi- aoli Li, and Chau Yuen. Entrope: Entropy-guided dynamic patch encoder for time series forecasting.arXiv preprint arXiv:2509.26157, 2025. 5
-
[2]
Token cropr: Faster vits for quite a few tasks
Benjamin Bergner, Christoph Lippert, and Aravindh Mahen- dran. Token cropr: Faster vits for quite a few tasks. InCVPR,
-
[3]
Flexivit: One model for all patch sizes
Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. Flexivit: One model for all patch sizes. In CVPR, 2023. 4, 1
2023
-
[4]
Multi-scale and token mergence: Make your vit more effi- cient.arXiv preprint arXiv:2306.04897, 2023
Zhe Bian, Zhe Wang, Wenqiang Han, and Kangping Wang. Multi-scale and token mergence: Make your vit more effi- cient.arXiv preprint arXiv:2306.04897, 2023. 3
-
[5]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 3
work page internal anchor Pith review arXiv 2022
-
[6]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InCVPR, 2020. 2, 5
2020
-
[7]
MMDetection3D: Open- MMLab next-generation platform for general 3D object de- tection, 2020
MMDetection3D Contributors. MMDetection3D: Open- MMLab next-generation platform for general 3D object de- tection, 2020. 5
2020
-
[8]
Heatvit: Hardware-efficient adaptive token pruning for vision transformers
Peiyan Dong, Mengshu Sun, Alec Lu, Yanyue Xie, Ken- neth Liu, Zhenglun Kong, Xin Meng, Zhengang Li, Xue Lin, Zhenman Fang, et al. Heatvit: Hardware-efficient adaptive token pruning for vision transformers. InHPCA, 2023. 3
2023
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Eva-02: A visual representation for neon genesis.Image and Vision Computing, 2024
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 2024. 4, 5
2024
-
[11]
Bhvit: Binarized hy- brid vision transformer
Tian Gao, Yu Zhang, Zhiyuan Zhang, Huajun Liu, Kaijie Yin, Chengzhong Xu, and Hui Kong. Bhvit: Binarized hy- brid vision transformer. InCVPR, 2025. 3
2025
-
[12]
Delving into se- quential patches for deepfake detection
Jiazhi Guan, Hang Zhou, Zhibin Hong, Errui Ding, Jingdong Wang, Chengbin Quan, and Youjian Zhao. Delving into se- quential patches for deepfake detection. InNeurIPS, 2022. 5
2022
-
[13]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,
-
[14]
Open: Object-wise position embedding for multi-view 3d object de- tection
Jinghua Hou, Tong Wang, Xiaoqing Ye, Zhe Liu, Shi Gong, Xiao Tan, Errui Ding, Jingdong Wang, and Xiang Bai. Open: Object-wise position embedding for multi-view 3d object de- tection. InECCV, 2024. 1
2024
-
[15]
arXiv preprint arXiv:2203.17054 (2022)
Junjie Huang and Guan Huang. Bevdet4d: Exploit tempo- ral cues in multi-camera 3d object detection.arXiv preprint arXiv:2203.17054, 2022. 1
-
[16]
Categorical Reparameterization with Gumbel-Softmax
Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144, 2016. 5
work page internal anchor Pith review arXiv 2016
-
[17]
Far3d: Expanding the horizon for surround-view 3d object detec- tion
Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, and Xiangyu Zhang. Far3d: Expanding the horizon for surround-view 3d object detec- tion. InAAAI, 2024. 6, 1
2024
-
[18]
Token fusion: Bridging the gap between token pruning and token merging
Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging. InWACV, 2024. 1
2024
-
[19]
Segment any- thing
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, and Wan-Yen Lo. Segment any- thing. InICCV, 2023. 7
2023
-
[20]
An energy and gpu-computation efficient backbone network for real-time object detection
Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. An energy and gpu-computation efficient backbone network for real-time object detection. In CVPR workshops, 2019. 6
2019
-
[21]
Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo
Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, and Zeming Li. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In AAAI, 2023. 6
2023
-
[22]
Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion
Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion. InAAAI, 2023. 1
2023
-
[23]
Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InECCV, 2022. 1
2022
-
[24]
Ashsr: Enhanc- ing query-based occupancy prediction via anti-occlusion sampling and hard sample reweighting.Neurocomputing,
Zhihao Li, Shanshan Zhang, and Jian Yang. Ashsr: Enhanc- ing query-based occupancy prediction via anti-occlusion sampling and hard sample reweighting.Neurocomputing,
-
[25]
Not all patches are what you need: Expediting vision transformers via token reorganizations
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganiza- tions.arXiv preprint arXiv:2202.07800, 2022. 2
-
[26]
arXiv preprint arXiv:2211.10581 (2022)
Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d: Multi-view 3d object detec- tion with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022. 1, 3, 6
-
[27]
Sparse4d v2: Recurrent temporal fusion with sparse model.arXiv preprint arXiv:2305.14018, 2023
Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d v2: Recurrent temporal fusion with sparse model.arXiv preprint arXiv:2305.14018, 2023. 6, 7
-
[28]
Not all tokens are what you need for pretraining
Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, Weizhu Chen, et al. Not all tokens are what you need for pretraining. InNeurIPS, 2024. 1
2024
-
[29]
Ray denoising: Depth-aware hard negative sampling for multi- view 3d object detection
Feng Liu, Tengteng Huang, Qianjing Zhang, Haotian Yao, Chi Zhang, Fang Wan, Qixiang Ye, and Yanzhao Zhou. Ray denoising: Depth-aware hard negative sampling for multi- view 3d object detection. InECCV, 2024. 1
2024
-
[30]
Sparsebev: High-performance sparse 3d object detec- tion from multi-camera videos
Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object detec- tion from multi-camera videos. InICCV, 2023. 3, 6
2023
-
[31]
Petr: Position embedding transformation for multi-view 3d object detection
Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. InECCV, 2022. 3, 6
2022
-
[32]
Revisiting token pruning for object detection and instance segmentation
Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Can- nici, and Davide Scaramuzza. Revisiting token pruning for object detection and instance segmentation. InWACV, 2024. 1
2024
-
[33]
Time will tell: New outlooks and a baseline for temporal multi- view 3d object detection
Jinhyung Park, Chenfeng Xu, Shijia Yang, Kurt Keutzer, Kris M Kitani, Masayoshi Tomizuka, and Wei Zhan. Time will tell: New outlooks and a baseline for temporal multi- view 3d object detection. InICLR, 2022. 6
2022
-
[34]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019. 5
2019
-
[35]
Dynamicvit: Efficient vision transformers with dynamic token sparsification
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. InNeurIPS,
-
[36]
Shunted self-attention via multi-scale token aggregation
Sucheng Ren, Daquan Zhou, Shengfeng He, Jiashi Feng, and Xinchao Wang. Shunted self-attention via multi-scale token aggregation. InCVPR, 2022. 3
2022
-
[37]
Learn- ing to merge tokens in vision transformers.arXiv preprint arXiv:2202.12015, 2022
Cedric Renggli, Andr ´e Susano Pinto, Neil Houlsby, Basil Mustafa, Joan Puigcerver, and Carlos Riquelme. Learn- ing to merge tokens in vision transformers.arXiv preprint arXiv:2202.12015, 2022. 3
-
[38]
Byungseok Roh, JaeWoong Shin, Wuhyun Shin, and Sae- hoon Kim. Sparse detr: Efficient end-to-end object detection with learnable sparsity.arXiv preprint arXiv:2111.14330,
-
[39]
3dppe: 3d point positional encoding for transformer-based multi-camera 3d object detection
Changyong Shu, Jiajun Deng, Fisher Yu, and Yifan Liu. 3dppe: 3d point positional encoding for transformer-based multi-camera 3d object detection. InICCV, 2023. 3, 6
2023
-
[40]
Linear regres- sion.Wiley Interdisciplinary Reviews: Computational Statis- tics, 2012
Xiaogang Su, Xin Yan, and Chih-Ling Tsai. Linear regres- sion.Wiley Interdisciplinary Reviews: Computational Statis- tics, 2012. 4
2012
-
[41]
Pv-mm3d: Point-voxel parallel dual-stream framework with dual-attention region adaptive fusion for multimodal 3d object detection.Infor- mation Fusion, 2025
Baotong Wang, Chenxing Xia, Xiuju Gao, Yuan Yang, Bin Ge, Kuan-Ching Li, and Yan Zhang. Pv-mm3d: Point-voxel parallel dual-stream framework with dual-attention region adaptive fusion for multimodal 3d object detection.Infor- mation Fusion, 2025. 1
2025
-
[42]
Exploring object-centric temporal modeling for efficient multi-view 3d object detection
Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xi- angyu Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. InICCV, 2023. 2, 3, 4, 5, 6, 7, 8, 1
2023
-
[43]
Detr3d: 3d object detection from multi-view images via 3d-to-2d queries
Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. InRL, 2022. 1, 3, 6, 7
2022
-
[44]
Argoverse 2: Next generation datasets for self-driving perception and fore- casting
Benjamin Wilson, William Qi, Tanmay Agarwal, John Lam- bert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Rat- nesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and fore- casting. InNeurIPS, 2021. 2, 5
2021
-
[45]
Cvt: Introducing con- volutions to vision transformers
Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing con- volutions to vision transformers. InICCV, 2021. 3
2021
-
[46]
Yuan Wu, Zhiqiang Yan, Yigong Zhang, Xiang Li, and Jian Yang. See through the dark: Learning illumination-affined representations for nighttime occupancy prediction.arXiv preprint arXiv:2505.20641, 2025. 1
-
[47]
Accelerate 3d object detection models via zero- shot attention key pruning
Lizhen Xu, Xiuxiu Bai, Xiaojun Jia, Jianwu Fang, and Shan- min Pang. Accelerate 3d object detection models via zero- shot attention key pruning. InICCV, 2025. 3, 6
2025
-
[48]
Uncertainty guided test-time training for face forgery detec- tion.Computer Vision and Image Understanding, 2024
Pengxiang Xu, Yang He, Jian Yang, and Shanshan Zhang. Uncertainty guided test-time training for face forgery detec- tion.Computer Vision and Image Understanding, 2024. 5
2024
-
[49]
Evo-vit: Slow-fast token evolution for dynamic vision transformer
Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-vit: Slow-fast token evolution for dynamic vision transformer. InAAAI, 2022. 2
2022
-
[50]
Tokens-to-token vit: Training vision transformers from scratch on imagenet
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. InICCV, 2021. 3
2021
-
[51]
Fanhu Zeng, Deli Yu, Zhenglun Kong, and Hao Tang. Token transforming: A unified and training-free token compres- sion framework for vision transformer acceleration.arXiv preprint arXiv:2506.05709, 2025. 1
-
[52]
Not all tokens are equal: Human-centric visual analysis via token clustering transformer
Wang Zeng, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang, and Xiaogang Wang. Not all tokens are equal: Human-centric visual analysis via token clustering transformer. InCVPR, 2022. 3
2022
-
[53]
Tcformer: visual recognition via token clustering transformer.IEEE TPAMI, 2024
Wang Zeng, Sheng Jin, Lumin Xu, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, and Xiaogang Wang. Tcformer: visual recognition via token clustering transformer.IEEE TPAMI, 2024. 1
2024
-
[54]
Make your vit-based multi-view 3d detectors faster via token com- pression
Dingyuan Zhang, Dingkang Liang, Zichang Tan, Xiaoqing Ye, Cheng Zhang, Jingdong Wang, and Xiang Bai. Make your vit-based multi-view 3d detectors faster via token com- pression. InECCV, 2024. 2, 3, 5, 6, 7
2024
-
[55]
Topformer: Token pyramid transformer for mobile semantic segmentation
Wenqiang Zhang, Zilong Huang, Guozhong Luo, Tao Chen, Xinggang Wang, Wenyu Liu, Gang Yu, and Chunhua Shen. Topformer: Token pyramid transformer for mobile semantic segmentation. InCVPR, 2022. 3
2022
-
[56]
Zherui Zhang, Rongtao Xu, Jie Zhou, Changwei Wang, Xingtian Pei, Wenhao Xu, Jiguang Zhang, Li Guo, Longx- iang Gao, Wenbo Xu, et al. Image recognition with on- line lightweight vision transformer: a survey.arXiv preprint arXiv:2505.03113, 2025. 2
-
[57]
World4drive: End-to-end au- tonomous driving via intention-aware physical latent world model
Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end au- tonomous driving via intention-aware physical latent world model. InICCV, 2025. 3
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.