pith. machine review for the scientific record. sign in

arxiv: 2605.08329 · v1 · submitted 2026-05-08 · 💻 cs.CV · eess.IV

Recognition: 2 theorem links

· Lean Theorem

An Efficient Token Compression Framework for Visual Object Tracking

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:22 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords visual object trackingtoken compressiontransformer trackertemplate tokensefficient inferencespatio-temporal contextadaptive filtering
0
0 comments X

The pith

Compressing historical template tokens lets visual trackers use more frames at lower compute cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to resolve the quadratic cost and performance drop that arise when Transformer trackers incorporate many historical template frames for richer context. It introduces a compress-then-interact pipeline that first prunes redundant tokens from those templates into a compact target representation, then lets the remaining tokens interact deeply with the current search region. The resulting system is claimed to deliver higher or equal accuracy on seven standard benchmarks while cutting multiply-accumulate operations by more than 20 percent. A sympathetic reader would care because most real-world tracking applications are constrained by power or latency, so any reliable way to retain context without paying the full token cost expands what is feasible on edge hardware.

Core claim

ETCTrack first runs an Adaptive Token Compressor that learns to discard redundant visual tokens from multiple past templates, yielding a small set of highly discriminative template tokens; these tokens then enter a Hierarchical Interaction Encoder that performs layered cross-attention with the search-frame features, producing refined search representations that support accurate target localization.

What carries the argument

Adaptive Token Compressor that dynamically filters redundant visual tokens from historical templates before deep interaction with search features.

If this is right

  • Trackers can safely increase the number of historical templates without a proportional rise in quadratic attention cost.
  • The refined search features produced after compression still support precise bounding-box regression.
  • Overall multiply-accumulate operations drop 21 percent on a 224-resolution backbone with only a 0.4 percent accuracy trade-off.
  • The same compression step can be inserted into other multi-template Transformer trackers that currently suffer token explosion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-filtering idea may transfer to other long-sequence vision tasks such as video action recognition or multi-object tracking where historical frames also create token overload.
  • In latency-sensitive settings the reduced MAC count could translate directly into higher sustained frame rates on mobile or embedded devices.
  • Because the compressor is learned rather than hand-crafted, retraining it on domain-specific data could further tighten the accuracy-efficiency frontier.

Load-bearing premise

Redundant visual tokens can be removed from historical templates without discarding the information needed to distinguish the target from distractors or background.

What would settle it

Run the method and the best uncompressed baseline on a new benchmark containing frequent heavy occlusion and visually similar distractors; if the compressed version shows a success-rate drop larger than 2-3 percent while the baseline does not, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.08329 by Bineng Zhong, Haiying Xia, Qihua Liang, Shuxiang Song, Weijing Wu, Zhiyi Mo.

Figure 1
Figure 1. Figure 1: (a) Comparison of AUC and MACs of recent SOTA [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) ETCTrack Framework Architecture. The process begins with our ATC module, which effectively compresses visual tokens from historical template frames to eliminate visual redundancy. These compressed tokens, along with search region tokens, are then fed into the Hierarchical Interaction Encoder for contextual feature interaction. Finally, the enhanced search features are sent to the Prediction Head to pre… view at source ↗
Figure 3
Figure 3. Figure 3: The Mask-Guided Token Pruning and Merging Module. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The structure of the Hierarchical Interaction Block. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: AUC scores of different attributes on LaSOT [ [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LaSOT AUC vs. template frames for our variants. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of visual redundancy elimination. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Refining visual representations by eliminating their internal feature-level redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker's overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens. These refined template tokens are then processed by our Hierarchical Interaction Encoder to achieve a deep, adaptive interaction with the search features. Refined search features ensure subsequent precise target localization. Experiments on seven benchmarks demonstrate that our method outperforms current state-of-the-art trackers. ETCTrack-B224 reduces the number of template tokens by 60%, leading to a 21.4% reduction in MACs with only a 0.4% drop in accuracy. The source code are available at https://github.com/PJD-WJ/ETCTrack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes ETCTrack, a compress-then-interact framework for Transformer-based visual object tracking. It introduces an Adaptive Token Compressor that dynamically filters redundant tokens from multiple historical template frames into a compact representation, followed by a Hierarchical Interaction Encoder that performs deep adaptive interactions between the compressed templates and search-region features. Experiments across seven benchmarks report that the method outperforms current state-of-the-art trackers; specifically, ETCTrack-B224 achieves a 60% reduction in template tokens, a 21.4% reduction in MACs, and only a 0.4% drop in accuracy relative to an uncompressed baseline.

Significance. If the reported efficiency-accuracy trade-off holds under rigorous controls, the work offers a practical route to scaling the number of historical templates in Transformer trackers without incurring quadratic cost or performance degradation. The learned compression approach, rather than handcrafted rules, could generalize to other token-heavy vision tasks. The public code release strengthens reproducibility.

major comments (2)
  1. [§5, Table 2] §5 (Experiments), Table 2 and the main results paragraph: the claim that ETCTrack 'outperforms current state-of-the-art trackers' is difficult to evaluate because the paper does not state whether the listed baselines (e.g., MixFormer, OSTrack) were re-run with the same number of historical template frames that ETCTrack uses before compression. If baselines operate with fewer templates, part of the reported gain may stem from increased temporal context rather than the compression mechanism itself.
  2. [§4.1] §4.1 (Adaptive Token Compressor): the description of the token-selection process relies on learned importance scores, yet no ablation isolates the contribution of the compressor versus the Hierarchical Interaction Encoder. Without a controlled experiment that replaces the compressor with random or uniform token subsampling while keeping the encoder fixed, it remains unclear whether the 0.4% accuracy drop is truly minimal or whether the compressor is discarding task-critical tokens that the encoder later compensates for.
minor comments (3)
  1. [Abstract] Abstract: 'The source code are available' contains a subject-verb agreement error; should read 'The source code is available'.
  2. [§3] §3 (Method overview): the notation for the compressed token set T' is introduced without an explicit equation relating it to the original token set T; adding a compact equation (e.g., T' = f_comp(T)) would improve readability.
  3. [Figure 3] Figure 3 (qualitative results): the caption does not indicate whether the visualized attention maps are from the compressed or uncompressed model, making it hard to attribute the improved localization to the proposed modules.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below with clarifications and commitments to revisions that strengthen the manuscript without misrepresenting our experimental setup.

read point-by-point responses
  1. Referee: [§5, Table 2] §5 (Experiments), Table 2 and the main results paragraph: the claim that ETCTrack 'outperforms current state-of-the-art trackers' is difficult to evaluate because the paper does not state whether the listed baselines (e.g., MixFormer, OSTrack) were re-run with the same number of historical template frames that ETCTrack uses before compression. If baselines operate with fewer templates, part of the reported gain may stem from increased temporal context rather than the compression mechanism itself.

    Authors: We appreciate this observation regarding fair comparison. The baselines were evaluated following the configurations reported in their original papers, which typically use a single template frame. ETCTrack is specifically designed to compress multiple historical template frames. In the revised manuscript, we will explicitly document the number of template frames employed by each baseline. We will also add results from re-evaluating the primary baselines (MixFormer and OSTrack) using an equivalent number of historical frames prior to compression. This will more clearly attribute performance differences to the compress-then-interact framework rather than temporal context alone. revision: partial

  2. Referee: [§4.1] §4.1 (Adaptive Token Compressor): the description of the token-selection process relies on learned importance scores, yet no ablation isolates the contribution of the compressor versus the Hierarchical Interaction Encoder. Without a controlled experiment that replaces the compressor with random or uniform token subsampling while keeping the encoder fixed, it remains unclear whether the 0.4% accuracy drop is truly minimal or whether the compressor is discarding task-critical tokens that the encoder later compensates for.

    Authors: We agree that isolating the contribution of the Adaptive Token Compressor is important. We will include a new ablation study in the revised manuscript in which the learned compressor is replaced by random and uniform subsampling while the Hierarchical Interaction Encoder remains fixed. This experiment will demonstrate that the learned importance scores are responsible for preserving discriminative tokens and achieving the minimal accuracy drop, rather than the encoder compensating for suboptimal selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes ETCTrack as a compress-then-interact framework using an Adaptive Token Compressor to filter redundant template tokens and a Hierarchical Interaction Encoder for search feature interaction. All reported gains (60% token reduction, 21.4% MACs drop, 0.4% accuracy change) are stated as direct experimental outcomes on seven public benchmarks rather than quantities derived from the method's own parameters or equations. No load-bearing steps reduce by construction to inputs, self-citations, or fitted values renamed as predictions. The derivation remains empirical and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the empirical effectiveness of two newly introduced learned modules whose internal hyperparameters and training details are not visible from the abstract alone.

free parameters (1)
  • token compression ratio
    The reported 60% reduction is achieved by the compressor and is likely controlled by a tunable threshold or learned parameter.
axioms (1)
  • domain assumption Redundant visual tokens in historical templates can be identified and removed without loss of target-discriminative information
    This premise underpins the design of the Adaptive Token Compressor.
invented entities (2)
  • Adaptive Token Compressor no independent evidence
    purpose: Dynamically filter redundant template tokens into a compact representation
    New module introduced by the paper; no independent evidence outside the reported experiments.
  • Hierarchical Interaction Encoder no independent evidence
    purpose: Enable deep adaptive interaction between compressed templates and search features
    New module introduced by the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5543 in / 1368 out tokens · 46314 ms · 2026-05-12T01:22:41.757751+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 1 internal anchor

  1. [1]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  2. [2]

    Ar- trackv2: Prompting autoregressive tracker where to look and how to describe

    Yifan Bai, Zeyang Zhao, Yihong Gong, and Xing Wei. Ar- trackv2: Prompting autoregressive tracker where to look and how to describe. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19048– 19057, 2024. 3, 6

  3. [3]

    Fully-convolutional siamese networks for object tracking

    Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. InComputer vision–ECCV 2016 workshops: Amsterdam, the Netherlands, October 8- 10 and 15-16, 2016, proceedings, part II 14, pages 850–865. Springer, 2016. 1, 2, 6

  4. [4]

    Hiptrack: Visual tracking with historical prompts

    Wenrui Cai, Qingjie Liu, and Yunhong Wang. Hiptrack: Visual tracking with historical prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19258–19267, 2024. 6

  5. [5]

    Spmtrack: Spatio-temporal parameter-efficient fine-tuning with mixture of experts for scalable visual tracking

    Wenrui Cai, Qingjie Liu, and Yunhong Wang. Spmtrack: Spatio-temporal parameter-efficient fine-tuning with mixture of experts for scalable visual tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16871–16881, 2025. 3, 6

  6. [6]

    Madtp: Multi- modal alignment-guided dynamic token pruning for accel- erating vision-language transformer

    Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yan- song Tang, Jiwen Lu, and Tao Chen. Madtp: Multi- modal alignment-guided dynamic token pruning for accel- erating vision-language transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15710–15719, 2024. 3

  7. [7]

    Seqtrack: Sequence to sequence learning for visual ob- ject tracking

    Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. Seqtrack: Sequence to sequence learning for visual ob- ject tracking. 2

  8. [8]

    Transformer tracking

    Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8126–8135, 2021. 2, 6

  9. [9]

    Mixformer: End-to-end tracking with iterative mixed atten- tion

    Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed atten- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 13608–13618,

  10. [10]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 4

  11. [11]

    Siamese cascaded region pro- posal networks for real-time visual tracking

    Heng Fan and Haibin Ling. Siamese cascaded region pro- posal networks for real-time visual tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7952–7961, 2019. 2

  12. [12]

    Lasot: A high-quality benchmark for large-scale single ob- ject tracking

    Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single ob- ject tracking. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 5, 6, 7

  13. [13]

    Lasot: A high-quality large-scale single object track- ing benchmark.International Journal of Computer Vi- sion,International Journal of Computer Vision, 2020

    Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit Harshit, Mingzhen Huang, Jue- huan Liu, Yong Xu, Chunyuan Liao, Yuan Lin, and Haibin Ling. Lasot: A high-quality large-scale single object track- ing benchmark.International Journal of Computer Vi- sion,International Journal of Computer Vision, 2020. 6, 7

  14. [14]

    Stmtrack: Template-free visual tracking with space-time memory networks

    Zhihong Fu, Qingjie Liu, Zehua Fu, and Yunhong Wang. Stmtrack: Template-free visual tracking with space-time memory networks. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13774–13783, 2021. 2

  15. [15]

    Robust tracking via learning model update with unsupervised anomaly detection philosophy.IEEE Transactions on Circuits and Systems for Video Technology, 33(5):2330–2341, 2022

    Jie Gao, Bineng Zhong, and Yan Chen. Robust tracking via learning model update with unsupervised anomaly detection philosophy.IEEE Transactions on Circuits and Systems for Video Technology, 33(5):2330–2341, 2022. 2

  16. [16]

    Unambiguous object tracking by exploiting target cues

    Jie Gao, Bineng Zhong, and Yan Chen. Unambiguous object tracking by exploiting target cues. InProceedings of the 31st ACM international conference on multimedia, pages 1997– 2005, 2023. 2

  17. [17]

    Dreamtrack: Dreaming the future for multi- modal visual object tracking

    Mingzhe Guo, Weiping Tan, Wenyu Ran, Liping Jing, and Zhipeng Zhang. Dreamtrack: Dreaming the future for multi- modal visual object tracking. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 7201–7210, 2025. 6

  18. [18]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 8

  19. [19]

    Target-aware tracking with long-term context at- tention

    Kaijie He, Canlong Zhang, Sheng Xie, Zhixin Li, and Zhi- wen Wang. Target-aware tracking with long-term context at- tention. InProceedings of the AAAI conference on artificial intelligence, pages 773–780, 2023. 2, 3

  20. [20]

    Matryoshka query transformer for large vision-language models.Advances in Neural Infor- mation Processing Systems, 37:50168–50188, 2024

    Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. Matryoshka query transformer for large vision-language models.Advances in Neural Infor- mation Processing Systems, 37:50168–50188, 2024. 3

  21. [21]

    Transformer track- ing via frequency fusion.IEEE Transactions on Circuits and Systems for Video Technology, 34(2):1020–1031, 2023

    Xiantao Hu, Bineng Zhong, Qihua Liang, Shengping Zhang, Ning Li, Xianxian Li, and Rongrong Ji. Transformer track- ing via frequency fusion.IEEE Transactions on Circuits and Systems for Video Technology, 34(2):1020–1031, 2023. 2

  22. [22]

    Toward modalities correlation for rgb-t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(10):9102–9111, 2024

    Xiantao Hu, Bineng Zhong, Qihua Liang, Shengping Zhang, Ning Li, and Xianxian Li. Toward modalities correlation for rgb-t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(10):9102–9111, 2024

  23. [23]

    Exploiting multimodal spatial-temporal patterns for video object tracking

    Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, and Jian Yang. Exploiting multimodal spatial-temporal patterns for video object tracking. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 3581–3589, 2025

  24. [24]

    Adaptive perception for unified visual multi-modal object tracking.IEEE Trans- actions on Artificial Intelligence, 2025

    Xiantao Hu, Bineng Zhong, Qihua Liang, Liangtao Shi, Zhiyi Mo, Ying Tai, and Jian Yang. Adaptive perception for unified visual multi-modal object tracking.IEEE Trans- actions on Artificial Intelligence, 2025. 2

  25. [25]

    Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, page 1562–1577, 2021

    Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, page 1562–1577, 2021. 6, 7

  26. [26]

    Token-efficient long video understanding for multimodal llms.arXiv preprint arXiv:2503.04130, 2025

    Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, et al. Token-efficient long video understanding for multimodal llms.arXiv preprint arXiv:2503.04130, 2025. 2

  27. [27]

    Exploring enhanced contextual information for video-level object tracking

    Ben Kang, Xin Chen, Simiao Lai, Yang Liu, Yi Liu, and Dong Wang. Exploring enhanced contextual information for video-level object tracking. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 4194–4202, 2025. 6

  28. [28]

    Need for speed: A bench- mark for higher frame rate object tracking

    Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. Need for speed: A bench- mark for higher frame rate object tracking. InProceedings of the IEEE international conference on computer vision, pages 1125–1134, 2017. 6, 7

  29. [30]

    High performance visual tracking with siamese region pro- posal network

    Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region pro- posal network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8971–8980,

  30. [31]

    Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks

    Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4282–4291, 2019. 2, 6

  31. [32]

    Otterhd: A high-resolution multi-modality model

    Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi- modality model.arXiv preprint arXiv:2311.04219, 2023. 3

  32. [33]

    Cadtrack: Learning contextual aggregation with deformable alignment for robust rgbt tracking.arXiv preprint arXiv:2511.17967, 2025

    Hao Li, Yuhao Wang, Xiantao Hu, Wenning Hao, Pingping Zhang, Dong Wang, and Huchuan Lu. Cadtrack: Learning contextual aggregation with deformable alignment for robust rgbt tracking.arXiv preprint arXiv:2511.17967, 2025. 2

  33. [34]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3, 4

  34. [35]

    Au- toregressive sequential pretraining for visual tracking

    Shiyi Liang, Yifan Bai, Yihong Gong, and Xing Wei. Au- toregressive sequential pretraining for visual tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7254–7264, 2025. 6

  35. [36]

    Moe-llava: Mix- ture of experts for large vision-language models

    Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision- language models.arXiv preprint arXiv:2401.15947, 2024. 3

  36. [37]

    Swintrack: A simple and strong baseline for trans- former tracking.Advances in Neural Information Processing Systems, 35:16743–16754, 2022

    Liting Lin, Heng Fan, Zhipeng Zhang, Yong Xu, and Haibin Ling. Swintrack: A simple and strong baseline for trans- former tracking.Advances in Neural Information Processing Systems, 35:16743–16754, 2022. 2

  37. [38]

    Lawrence Zitnick.Microsoft COCO: Common Objects in Context, page 740–755

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick.Microsoft COCO: Common Objects in Context, page 740–755. 2014. 6

  38. [39]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In2017 IEEE International Conference on Computer Vision (ICCV),

  39. [40]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 3

  40. [41]

    Video-xl-pro: Reconstructive token compres- sion for extremely long video understanding.arXiv preprint arXiv:2503.18478, 2025

    Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, and Bo Zhao. Video-xl-pro: Reconstructive token compres- sion for extremely long video understanding.arXiv preprint arXiv:2503.18478, 2025. 2

  41. [42]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2

  42. [43]

    Hybrid-level instruction injection for video token com- pression in multi-modal large language models

    Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, and Hongtao Xie. Hybrid-level instruction injection for video token com- pression in multi-modal large language models. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 8568–8578, 2025. 2

  43. [44]

    Internvl-x: Advanc- ing and accelerating internvl series with efficient visual token compression.arXiv preprint arXiv:2503.21307, 2025

    Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jian- liang Zeng, Mao Shu, and Huo Cao. Internvl-x: Advanc- ing and accelerating internvl series with efficient visual token compression.arXiv preprint arXiv:2503.21307, 2025. 2

  44. [45]

    Matthias M ¨uller, Adel Bibi, Silvio Giancola, Salman Alsub- aihi, and Bernard Ghanem.TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild, page 310–327. 2018. 6, 7

  45. [46]

    Vast- track: Vast category visual object tracking.Advances in Neu- ral Information Processing Systems, 37:130797–130818,

    Liang Peng, Junyuan Gao, Xinran Liu, Weihong Li, Shaohua Dong, Zhipeng Zhang, Heng Fan, and Libo Zhang. Vast- track: Vast category visual object tracking.Advances in Neu- ral Information Processing Systems, 37:130797–130818,

  46. [47]

    Generalized in- tersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. In2019 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2019. 5

  47. [48]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,

  48. [49]

    Explicit visual prompts for visual object tracking

    Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Sheng- ping Zhang, and Xianxian Li. Explicit visual prompts for visual object tracking. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 4838–4846, 2024. 3

  49. [50]

    Transformer tracking with cyclic shifting window attention

    Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Transformer tracking with cyclic shifting window attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8791–8800,

  50. [51]

    Compact transformer tracker with correlative masked modeling

    Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Compact transformer tracker with correlative masked modeling. InProceedings of the AAAI conference on artificial intelligence, pages 2321–2329, 2023. 2

  51. [52]

    Siamese instance search for tracking

    Ran Tao, Efstratios Gavves, and Arnold WM Smeulders. Siamese instance search for tracking. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1420–1429, 2016. 1

  52. [53]

    Fast-itpn: Integrally pre- trained transformer pyramid network with token migration

    Yunjie Tian, Lingxi Xie, Jihao Qiu, Jianbin Jiao, Yaowei Wang, Qi Tian, and Qixiang Ye. Fast-itpn: Integrally pre- trained transformer pyramid network with token migration. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2024. 2, 4, 5, 6, 8

  53. [54]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

  54. [55]

    Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark

    Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13763–13773, 2021. 6, 7

  55. [56]

    Autoregressive visual tracking

    Xing Wei, Yifan Bai, Yongchao Zheng, Dahu Shi, and Yi- hong Gong. Autoregressive visual tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 9697–9706, 2023. 3

  56. [57]

    Object track- ing benchmark.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1834–1848, 2015

    Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object track- ing benchmark.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1834–1848, 2015. 6, 7

  57. [58]

    Video- track: Learning to track objects via video transformer

    Fei Xie, Lei Chu, Jiahao Li, Yan Lu, and Chao Ma. Video- track: Learning to track objects via video transformer. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 22826–22835, 2023. 2

  58. [59]

    Video- track: Learning to track objects via video transformer

    Fei Xie, Lei Chu, Jiahao Li, Yan Lu, and Chao Ma. Video- track: Learning to track objects via video transformer. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 22826–22835, 2023. 6

  59. [60]

    Robust tracking via mamba- based context-aware token learning

    Jinxia Xie, Bineng Zhong, Qihua Liang, Ning Li, Zhiyi Mo, and Shuxiang Song. Robust tracking via mamba- based context-aware token learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8727– 8735, 2025. 3, 6

  60. [61]

    Less is more: Token context-aware learning for object tracking

    Chenlong Xu, Bineng Zhong, Qihua Liang, Yaozong Zheng, Guorong Li, and Shuxiang Song. Less is more: Token context-aware learning for object tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8824– 8832, 2025. 2, 4

  61. [62]

    Similarity- guided layer-adaptive vision transformer for uav tracking

    Chaocan Xue, Bineng Zhong, Qihua Liang, Yaozong Zheng, Ning Li, Yuanliang Xue, and Shuxiang Song. Similarity- guided layer-adaptive vision transformer for uav tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 6730–6740, 2025. 2

  62. [63]

    Learning spatio-temporal transformer for vi- sual tracking

    Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for vi- sual tracking. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 2

  63. [64]

    Foreground-background distribution mod- eling transformer for visual object tracking

    Dawei Yang, Jianfeng He, Yinchao Ma, Qianjin Yu, and Tianzhu Zhang. Foreground-background distribution mod- eling transformer for visual object tracking. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 10117–10127, 2023. 6

  64. [65]

    Joint feature learning and relation modeling for tracking: A one-stream framework

    Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InEuropean conference on computer vision, pages 341–357. Springer, 2022. 2, 5, 6

  65. [66]

    Token-level correlation-guided com- pression for efficient multimodal document understanding

    Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, and Liqiang Nie. Token-level correlation-guided com- pression for efficient multimodal document understanding. arXiv preprint arXiv:2407.14439, 2024. 3

  66. [67]

    Hivit: A simpler and more efficient design of hierarchical vision transformer

    Xiaosong Zhang, Yunjie Tian, Lingxi Xie, Wei Huang, Qi Dai, Qixiang Ye, and Qi Tian. Hivit: A simpler and more efficient design of hierarchical vision transformer. InThe Eleventh International Conference on Learning Representa- tions, 2023. 2

  67. [68]

    Leveraging local and global cues for visual tracking via parallel interaction network

    Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhenjun Tang, Rongrong Ji, and Xianxian Li. Leveraging local and global cues for visual tracking via parallel interaction network. IEEE Transactions on Circuits and Systems for Video Tech- nology, 33(4):1671–1683, 2022. 2

  68. [69]

    Toward unified token learning for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023

    Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, and Xianxian Li. Toward unified token learning for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023. 2

  69. [70]

    Odtrack: Online dense temporal token learning for visual tracking

    Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. InProceed- ings of the AAAI conference on artificial intelligence, pages 7588–7596, 2024. 2, 3, 4, 6

  70. [71]

    Decoupled spatio-temporal consistency learning for self-supervised tracking

    Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, and Shuxiang Song. Decoupled spatio-temporal consistency learning for self-supervised tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10635– 10643, 2025. 2

  71. [72]

    Towards universal modal tracking with online dense temporal token learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025

    Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, and Rongrong Ji. Towards universal modal tracking with online dense temporal token learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025. 2

  72. [73]

    Focusllava: A coarse-to-fine approach for efficient and effective visual token compression.arXiv preprint arXiv:2411.14228, 2024

    Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, and Sheng Guo. Focusllava: A coarse-to-fine approach for efficient and effective visual token compression.arXiv preprint arXiv:2411.14228, 2024. 2, 4