Recognition: 2 theorem links
· Lean TheoremAn Efficient Token Compression Framework for Visual Object Tracking
Pith reviewed 2026-05-12 01:22 UTC · model grok-4.3
The pith
Compressing historical template tokens lets visual trackers use more frames at lower compute cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ETCTrack first runs an Adaptive Token Compressor that learns to discard redundant visual tokens from multiple past templates, yielding a small set of highly discriminative template tokens; these tokens then enter a Hierarchical Interaction Encoder that performs layered cross-attention with the search-frame features, producing refined search representations that support accurate target localization.
What carries the argument
Adaptive Token Compressor that dynamically filters redundant visual tokens from historical templates before deep interaction with search features.
If this is right
- Trackers can safely increase the number of historical templates without a proportional rise in quadratic attention cost.
- The refined search features produced after compression still support precise bounding-box regression.
- Overall multiply-accumulate operations drop 21 percent on a 224-resolution backbone with only a 0.4 percent accuracy trade-off.
- The same compression step can be inserted into other multi-template Transformer trackers that currently suffer token explosion.
Where Pith is reading between the lines
- The same token-filtering idea may transfer to other long-sequence vision tasks such as video action recognition or multi-object tracking where historical frames also create token overload.
- In latency-sensitive settings the reduced MAC count could translate directly into higher sustained frame rates on mobile or embedded devices.
- Because the compressor is learned rather than hand-crafted, retraining it on domain-specific data could further tighten the accuracy-efficiency frontier.
Load-bearing premise
Redundant visual tokens can be removed from historical templates without discarding the information needed to distinguish the target from distractors or background.
What would settle it
Run the method and the best uncompressed baseline on a new benchmark containing frequent heavy occlusion and visually similar distractors; if the compressed version shows a success-rate drop larger than 2-3 percent while the baseline does not, the central claim fails.
Figures
read the original abstract
Refining visual representations by eliminating their internal feature-level redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker's overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens. These refined template tokens are then processed by our Hierarchical Interaction Encoder to achieve a deep, adaptive interaction with the search features. Refined search features ensure subsequent precise target localization. Experiments on seven benchmarks demonstrate that our method outperforms current state-of-the-art trackers. ETCTrack-B224 reduces the number of template tokens by 60%, leading to a 21.4% reduction in MACs with only a 0.4% drop in accuracy. The source code are available at https://github.com/PJD-WJ/ETCTrack.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ETCTrack, a compress-then-interact framework for Transformer-based visual object tracking. It introduces an Adaptive Token Compressor that dynamically filters redundant tokens from multiple historical template frames into a compact representation, followed by a Hierarchical Interaction Encoder that performs deep adaptive interactions between the compressed templates and search-region features. Experiments across seven benchmarks report that the method outperforms current state-of-the-art trackers; specifically, ETCTrack-B224 achieves a 60% reduction in template tokens, a 21.4% reduction in MACs, and only a 0.4% drop in accuracy relative to an uncompressed baseline.
Significance. If the reported efficiency-accuracy trade-off holds under rigorous controls, the work offers a practical route to scaling the number of historical templates in Transformer trackers without incurring quadratic cost or performance degradation. The learned compression approach, rather than handcrafted rules, could generalize to other token-heavy vision tasks. The public code release strengthens reproducibility.
major comments (2)
- [§5, Table 2] §5 (Experiments), Table 2 and the main results paragraph: the claim that ETCTrack 'outperforms current state-of-the-art trackers' is difficult to evaluate because the paper does not state whether the listed baselines (e.g., MixFormer, OSTrack) were re-run with the same number of historical template frames that ETCTrack uses before compression. If baselines operate with fewer templates, part of the reported gain may stem from increased temporal context rather than the compression mechanism itself.
- [§4.1] §4.1 (Adaptive Token Compressor): the description of the token-selection process relies on learned importance scores, yet no ablation isolates the contribution of the compressor versus the Hierarchical Interaction Encoder. Without a controlled experiment that replaces the compressor with random or uniform token subsampling while keeping the encoder fixed, it remains unclear whether the 0.4% accuracy drop is truly minimal or whether the compressor is discarding task-critical tokens that the encoder later compensates for.
minor comments (3)
- [Abstract] Abstract: 'The source code are available' contains a subject-verb agreement error; should read 'The source code is available'.
- [§3] §3 (Method overview): the notation for the compressed token set T' is introduced without an explicit equation relating it to the original token set T; adding a compact equation (e.g., T' = f_comp(T)) would improve readability.
- [Figure 3] Figure 3 (qualitative results): the caption does not indicate whether the visualized attention maps are from the compressed or uncompressed model, making it hard to attribute the improved localization to the proposed modules.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below with clarifications and commitments to revisions that strengthen the manuscript without misrepresenting our experimental setup.
read point-by-point responses
-
Referee: [§5, Table 2] §5 (Experiments), Table 2 and the main results paragraph: the claim that ETCTrack 'outperforms current state-of-the-art trackers' is difficult to evaluate because the paper does not state whether the listed baselines (e.g., MixFormer, OSTrack) were re-run with the same number of historical template frames that ETCTrack uses before compression. If baselines operate with fewer templates, part of the reported gain may stem from increased temporal context rather than the compression mechanism itself.
Authors: We appreciate this observation regarding fair comparison. The baselines were evaluated following the configurations reported in their original papers, which typically use a single template frame. ETCTrack is specifically designed to compress multiple historical template frames. In the revised manuscript, we will explicitly document the number of template frames employed by each baseline. We will also add results from re-evaluating the primary baselines (MixFormer and OSTrack) using an equivalent number of historical frames prior to compression. This will more clearly attribute performance differences to the compress-then-interact framework rather than temporal context alone. revision: partial
-
Referee: [§4.1] §4.1 (Adaptive Token Compressor): the description of the token-selection process relies on learned importance scores, yet no ablation isolates the contribution of the compressor versus the Hierarchical Interaction Encoder. Without a controlled experiment that replaces the compressor with random or uniform token subsampling while keeping the encoder fixed, it remains unclear whether the 0.4% accuracy drop is truly minimal or whether the compressor is discarding task-critical tokens that the encoder later compensates for.
Authors: We agree that isolating the contribution of the Adaptive Token Compressor is important. We will include a new ablation study in the revised manuscript in which the learned compressor is replaced by random and uniform subsampling while the Hierarchical Interaction Encoder remains fixed. This experiment will demonstrate that the learned importance scores are responsible for preserving discriminative tokens and achieving the minimal accuracy drop, rather than the encoder compensating for suboptimal selection. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes ETCTrack as a compress-then-interact framework using an Adaptive Token Compressor to filter redundant template tokens and a Hierarchical Interaction Encoder for search feature interaction. All reported gains (60% token reduction, 21.4% MACs drop, 0.4% accuracy change) are stated as direct experimental outcomes on seven public benchmarks rather than quantities derived from the method's own parameters or equations. No load-bearing steps reduce by construction to inputs, self-citations, or fitted values renamed as predictions. The derivation remains empirical and self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- token compression ratio
axioms (1)
- domain assumption Redundant visual tokens in historical templates can be identified and removed without loss of target-discriminative information
invented entities (2)
-
Adaptive Token Compressor
no independent evidence
-
Hierarchical Interaction Encoder
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens... Mask-Guided Token Pruning & Merging Module... greedy cosine similarity matching
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Hierarchical Interaction Encoder... multi-stage interaction process... context-aware enrichment of the template, followed by unified feature learning, and a final template-guided refinement
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
-
[2]
Ar- trackv2: Prompting autoregressive tracker where to look and how to describe
Yifan Bai, Zeyang Zhao, Yihong Gong, and Xing Wei. Ar- trackv2: Prompting autoregressive tracker where to look and how to describe. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19048– 19057, 2024. 3, 6
work page 2024
-
[3]
Fully-convolutional siamese networks for object tracking
Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. InComputer vision–ECCV 2016 workshops: Amsterdam, the Netherlands, October 8- 10 and 15-16, 2016, proceedings, part II 14, pages 850–865. Springer, 2016. 1, 2, 6
work page 2016
-
[4]
Hiptrack: Visual tracking with historical prompts
Wenrui Cai, Qingjie Liu, and Yunhong Wang. Hiptrack: Visual tracking with historical prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19258–19267, 2024. 6
work page 2024
-
[5]
Wenrui Cai, Qingjie Liu, and Yunhong Wang. Spmtrack: Spatio-temporal parameter-efficient fine-tuning with mixture of experts for scalable visual tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16871–16881, 2025. 3, 6
work page 2025
-
[6]
Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yan- song Tang, Jiwen Lu, and Tao Chen. Madtp: Multi- modal alignment-guided dynamic token pruning for accel- erating vision-language transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15710–15719, 2024. 3
work page 2024
-
[7]
Seqtrack: Sequence to sequence learning for visual ob- ject tracking
Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. Seqtrack: Sequence to sequence learning for visual ob- ject tracking. 2
-
[8]
Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8126–8135, 2021. 2, 6
work page 2021
-
[9]
Mixformer: End-to-end tracking with iterative mixed atten- tion
Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed atten- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 13608–13618,
-
[10]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[11]
Siamese cascaded region pro- posal networks for real-time visual tracking
Heng Fan and Haibin Ling. Siamese cascaded region pro- posal networks for real-time visual tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7952–7961, 2019. 2
work page 2019
-
[12]
Lasot: A high-quality benchmark for large-scale single ob- ject tracking
Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single ob- ject tracking. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 5, 6, 7
work page 2019
-
[13]
Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit Harshit, Mingzhen Huang, Jue- huan Liu, Yong Xu, Chunyuan Liao, Yuan Lin, and Haibin Ling. Lasot: A high-quality large-scale single object track- ing benchmark.International Journal of Computer Vi- sion,International Journal of Computer Vision, 2020. 6, 7
work page 2020
-
[14]
Stmtrack: Template-free visual tracking with space-time memory networks
Zhihong Fu, Qingjie Liu, Zehua Fu, and Yunhong Wang. Stmtrack: Template-free visual tracking with space-time memory networks. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13774–13783, 2021. 2
work page 2021
-
[15]
Jie Gao, Bineng Zhong, and Yan Chen. Robust tracking via learning model update with unsupervised anomaly detection philosophy.IEEE Transactions on Circuits and Systems for Video Technology, 33(5):2330–2341, 2022. 2
work page 2022
-
[16]
Unambiguous object tracking by exploiting target cues
Jie Gao, Bineng Zhong, and Yan Chen. Unambiguous object tracking by exploiting target cues. InProceedings of the 31st ACM international conference on multimedia, pages 1997– 2005, 2023. 2
work page 1997
-
[17]
Dreamtrack: Dreaming the future for multi- modal visual object tracking
Mingzhe Guo, Weiping Tan, Wenyu Ran, Liping Jing, and Zhipeng Zhang. Dreamtrack: Dreaming the future for multi- modal visual object tracking. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 7201–7210, 2025. 6
work page 2025
-
[18]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 8
work page 2022
-
[19]
Target-aware tracking with long-term context at- tention
Kaijie He, Canlong Zhang, Sheng Xie, Zhixin Li, and Zhi- wen Wang. Target-aware tracking with long-term context at- tention. InProceedings of the AAAI conference on artificial intelligence, pages 773–780, 2023. 2, 3
work page 2023
-
[20]
Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. Matryoshka query transformer for large vision-language models.Advances in Neural Infor- mation Processing Systems, 37:50168–50188, 2024. 3
work page 2024
-
[21]
Xiantao Hu, Bineng Zhong, Qihua Liang, Shengping Zhang, Ning Li, Xianxian Li, and Rongrong Ji. Transformer track- ing via frequency fusion.IEEE Transactions on Circuits and Systems for Video Technology, 34(2):1020–1031, 2023. 2
work page 2023
-
[22]
Xiantao Hu, Bineng Zhong, Qihua Liang, Shengping Zhang, Ning Li, and Xianxian Li. Toward modalities correlation for rgb-t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(10):9102–9111, 2024
work page 2024
-
[23]
Exploiting multimodal spatial-temporal patterns for video object tracking
Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, and Jian Yang. Exploiting multimodal spatial-temporal patterns for video object tracking. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 3581–3589, 2025
work page 2025
-
[24]
Xiantao Hu, Bineng Zhong, Qihua Liang, Liangtao Shi, Zhiyi Mo, Ying Tai, and Jian Yang. Adaptive perception for unified visual multi-modal object tracking.IEEE Trans- actions on Artificial Intelligence, 2025. 2
work page 2025
-
[25]
Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, page 1562–1577, 2021. 6, 7
work page 2021
-
[26]
Token-efficient long video understanding for multimodal llms.arXiv preprint arXiv:2503.04130, 2025
Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, et al. Token-efficient long video understanding for multimodal llms.arXiv preprint arXiv:2503.04130, 2025. 2
-
[27]
Exploring enhanced contextual information for video-level object tracking
Ben Kang, Xin Chen, Simiao Lai, Yang Liu, Yi Liu, and Dong Wang. Exploring enhanced contextual information for video-level object tracking. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 4194–4202, 2025. 6
work page 2025
-
[28]
Need for speed: A bench- mark for higher frame rate object tracking
Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. Need for speed: A bench- mark for higher frame rate object tracking. InProceedings of the IEEE international conference on computer vision, pages 1125–1134, 2017. 6, 7
work page 2017
-
[30]
High performance visual tracking with siamese region pro- posal network
Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region pro- posal network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8971–8980,
-
[31]
Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks
Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4282–4291, 2019. 2, 6
work page 2019
-
[32]
Otterhd: A high-resolution multi-modality model
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi- modality model.arXiv preprint arXiv:2311.04219, 2023. 3
-
[33]
Hao Li, Yuhao Wang, Xiantao Hu, Wenning Hao, Pingping Zhang, Dong Wang, and Huchuan Lu. Cadtrack: Learning contextual aggregation with deformable alignment for robust rgbt tracking.arXiv preprint arXiv:2511.17967, 2025. 2
-
[34]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3, 4
work page 2023
-
[35]
Au- toregressive sequential pretraining for visual tracking
Shiyi Liang, Yifan Bai, Yihong Gong, and Xing Wei. Au- toregressive sequential pretraining for visual tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7254–7264, 2025. 6
work page 2025
-
[36]
Moe-llava: Mix- ture of experts for large vision-language models
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision- language models.arXiv preprint arXiv:2401.15947, 2024. 3
-
[37]
Liting Lin, Heng Fan, Zhipeng Zhang, Yong Xu, and Haibin Ling. Swintrack: A simple and strong baseline for trans- former tracking.Advances in Neural Information Processing Systems, 35:16743–16754, 2022. 2
work page 2022
-
[38]
Lawrence Zitnick.Microsoft COCO: Common Objects in Context, page 740–755
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick.Microsoft COCO: Common Objects in Context, page 740–755. 2014. 6
work page 2014
-
[39]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In2017 IEEE International Conference on Computer Vision (ICCV),
-
[40]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 3
work page 2024
-
[41]
Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, and Bo Zhao. Video-xl-pro: Reconstructive token compres- sion for extremely long video understanding.arXiv preprint arXiv:2503.18478, 2025. 2
-
[42]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2
work page 2021
-
[43]
Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, and Hongtao Xie. Hybrid-level instruction injection for video token com- pression in multi-modal large language models. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 8568–8578, 2025. 2
work page 2025
-
[44]
Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jian- liang Zeng, Mao Shu, and Huo Cao. Internvl-x: Advanc- ing and accelerating internvl series with efficient visual token compression.arXiv preprint arXiv:2503.21307, 2025. 2
-
[45]
Matthias M ¨uller, Adel Bibi, Silvio Giancola, Salman Alsub- aihi, and Bernard Ghanem.TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild, page 310–327. 2018. 6, 7
work page 2018
-
[46]
Liang Peng, Junyuan Gao, Xinran Liu, Weihong Li, Shaohua Dong, Zhipeng Zhang, Heng Fan, and Libo Zhang. Vast- track: Vast category visual object tracking.Advances in Neu- ral Information Processing Systems, 37:130797–130818,
-
[47]
Generalized in- tersection over union: A metric and a loss for bounding box regression
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. In2019 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2019. 5
work page 2019
-
[48]
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,
-
[49]
Explicit visual prompts for visual object tracking
Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Sheng- ping Zhang, and Xianxian Li. Explicit visual prompts for visual object tracking. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 4838–4846, 2024. 3
work page 2024
-
[50]
Transformer tracking with cyclic shifting window attention
Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Transformer tracking with cyclic shifting window attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8791–8800,
-
[51]
Compact transformer tracker with correlative masked modeling
Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Compact transformer tracker with correlative masked modeling. InProceedings of the AAAI conference on artificial intelligence, pages 2321–2329, 2023. 2
work page 2023
-
[52]
Siamese instance search for tracking
Ran Tao, Efstratios Gavves, and Arnold WM Smeulders. Siamese instance search for tracking. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1420–1429, 2016. 1
work page 2016
-
[53]
Fast-itpn: Integrally pre- trained transformer pyramid network with token migration
Yunjie Tian, Lingxi Xie, Jihao Qiu, Jianbin Jiao, Yaowei Wang, Qi Tian, and Qixiang Ye. Fast-itpn: Integrally pre- trained transformer pyramid network with token migration. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2024. 2, 4, 5, 6, 8
work page 2024
-
[54]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2
work page 2017
-
[55]
Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark
Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13763–13773, 2021. 6, 7
work page 2021
-
[56]
Autoregressive visual tracking
Xing Wei, Yifan Bai, Yongchao Zheng, Dahu Shi, and Yi- hong Gong. Autoregressive visual tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 9697–9706, 2023. 3
work page 2023
-
[57]
Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object track- ing benchmark.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1834–1848, 2015. 6, 7
work page 2015
-
[58]
Video- track: Learning to track objects via video transformer
Fei Xie, Lei Chu, Jiahao Li, Yan Lu, and Chao Ma. Video- track: Learning to track objects via video transformer. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 22826–22835, 2023. 2
work page 2023
-
[59]
Video- track: Learning to track objects via video transformer
Fei Xie, Lei Chu, Jiahao Li, Yan Lu, and Chao Ma. Video- track: Learning to track objects via video transformer. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 22826–22835, 2023. 6
work page 2023
-
[60]
Robust tracking via mamba- based context-aware token learning
Jinxia Xie, Bineng Zhong, Qihua Liang, Ning Li, Zhiyi Mo, and Shuxiang Song. Robust tracking via mamba- based context-aware token learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8727– 8735, 2025. 3, 6
work page 2025
-
[61]
Less is more: Token context-aware learning for object tracking
Chenlong Xu, Bineng Zhong, Qihua Liang, Yaozong Zheng, Guorong Li, and Shuxiang Song. Less is more: Token context-aware learning for object tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8824– 8832, 2025. 2, 4
work page 2025
-
[62]
Similarity- guided layer-adaptive vision transformer for uav tracking
Chaocan Xue, Bineng Zhong, Qihua Liang, Yaozong Zheng, Ning Li, Yuanliang Xue, and Shuxiang Song. Similarity- guided layer-adaptive vision transformer for uav tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 6730–6740, 2025. 2
work page 2025
-
[63]
Learning spatio-temporal transformer for vi- sual tracking
Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for vi- sual tracking. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 2
work page 2021
-
[64]
Foreground-background distribution mod- eling transformer for visual object tracking
Dawei Yang, Jianfeng He, Yinchao Ma, Qianjin Yu, and Tianzhu Zhang. Foreground-background distribution mod- eling transformer for visual object tracking. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 10117–10127, 2023. 6
work page 2023
-
[65]
Joint feature learning and relation modeling for tracking: A one-stream framework
Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InEuropean conference on computer vision, pages 341–357. Springer, 2022. 2, 5, 6
work page 2022
-
[66]
Token-level correlation-guided com- pression for efficient multimodal document understanding
Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, and Liqiang Nie. Token-level correlation-guided com- pression for efficient multimodal document understanding. arXiv preprint arXiv:2407.14439, 2024. 3
-
[67]
Hivit: A simpler and more efficient design of hierarchical vision transformer
Xiaosong Zhang, Yunjie Tian, Lingxi Xie, Wei Huang, Qi Dai, Qixiang Ye, and Qi Tian. Hivit: A simpler and more efficient design of hierarchical vision transformer. InThe Eleventh International Conference on Learning Representa- tions, 2023. 2
work page 2023
-
[68]
Leveraging local and global cues for visual tracking via parallel interaction network
Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhenjun Tang, Rongrong Ji, and Xianxian Li. Leveraging local and global cues for visual tracking via parallel interaction network. IEEE Transactions on Circuits and Systems for Video Tech- nology, 33(4):1671–1683, 2022. 2
work page 2022
-
[69]
Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, and Xianxian Li. Toward unified token learning for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023. 2
work page 2023
-
[70]
Odtrack: Online dense temporal token learning for visual tracking
Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. InProceed- ings of the AAAI conference on artificial intelligence, pages 7588–7596, 2024. 2, 3, 4, 6
work page 2024
-
[71]
Decoupled spatio-temporal consistency learning for self-supervised tracking
Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, and Shuxiang Song. Decoupled spatio-temporal consistency learning for self-supervised tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10635– 10643, 2025. 2
work page 2025
-
[72]
Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, and Rongrong Ji. Towards universal modal tracking with online dense temporal token learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025. 2
work page 2025
-
[73]
Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, and Sheng Guo. Focusllava: A coarse-to-fine approach for efficient and effective visual token compression.arXiv preprint arXiv:2411.14228, 2024. 2, 4
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.