arxiv: 2605.08329 · v1 · submitted 2026-05-08 · 💻 cs.CV · eess.IV

Recognition: 2 theorem links

· Lean Theorem

An Efficient Token Compression Framework for Visual Object Tracking

Weijing Wu , Qihua Liang , Bineng Zhong , Haiying Xia , Zhiyi Mo , Shuxiang Song

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:22 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords visual object trackingtoken compressiontransformer trackertemplate tokensefficient inferencespatio-temporal contextadaptive filtering

0 comments

The pith

Compressing historical template tokens lets visual trackers use more frames at lower compute cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to resolve the quadratic cost and performance drop that arise when Transformer trackers incorporate many historical template frames for richer context. It introduces a compress-then-interact pipeline that first prunes redundant tokens from those templates into a compact target representation, then lets the remaining tokens interact deeply with the current search region. The resulting system is claimed to deliver higher or equal accuracy on seven standard benchmarks while cutting multiply-accumulate operations by more than 20 percent. A sympathetic reader would care because most real-world tracking applications are constrained by power or latency, so any reliable way to retain context without paying the full token cost expands what is feasible on edge hardware.

Core claim

ETCTrack first runs an Adaptive Token Compressor that learns to discard redundant visual tokens from multiple past templates, yielding a small set of highly discriminative template tokens; these tokens then enter a Hierarchical Interaction Encoder that performs layered cross-attention with the search-frame features, producing refined search representations that support accurate target localization.

What carries the argument

Adaptive Token Compressor that dynamically filters redundant visual tokens from historical templates before deep interaction with search features.

If this is right

Trackers can safely increase the number of historical templates without a proportional rise in quadratic attention cost.
The refined search features produced after compression still support precise bounding-box regression.
Overall multiply-accumulate operations drop 21 percent on a 224-resolution backbone with only a 0.4 percent accuracy trade-off.
The same compression step can be inserted into other multi-template Transformer trackers that currently suffer token explosion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-filtering idea may transfer to other long-sequence vision tasks such as video action recognition or multi-object tracking where historical frames also create token overload.
In latency-sensitive settings the reduced MAC count could translate directly into higher sustained frame rates on mobile or embedded devices.
Because the compressor is learned rather than hand-crafted, retraining it on domain-specific data could further tighten the accuracy-efficiency frontier.

Load-bearing premise

Redundant visual tokens can be removed from historical templates without discarding the information needed to distinguish the target from distractors or background.

What would settle it

Run the method and the best uncompressed baseline on a new benchmark containing frequent heavy occlusion and visually similar distractors; if the compressed version shows a success-rate drop larger than 2-3 percent while the baseline does not, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.08329 by Bineng Zhong, Haiying Xia, Qihua Liang, Shuxiang Song, Weijing Wu, Zhiyi Mo.

**Figure 2.** Figure 2: (a) ETCTrack Framework Architecture. The process begins with our ATC module, which effectively compresses visual tokens from historical template frames to eliminate visual redundancy. These compressed tokens, along with search region tokens, are then fed into the Hierarchical Interaction Encoder for contextual feature interaction. Finally, the enhanced search features are sent to the Prediction Head to pre… view at source ↗

**Figure 3.** Figure 3: The Mask-Guided Token Pruning and Merging Module. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The structure of the Hierarchical Interaction Block. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: AUC scores of different attributes on LaSOT [ [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: LaSOT AUC vs. template frames for our variants. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of visual redundancy elimination. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Refining visual representations by eliminating their internal feature-level redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker's overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens. These refined template tokens are then processed by our Hierarchical Interaction Encoder to achieve a deep, adaptive interaction with the search features. Refined search features ensure subsequent precise target localization. Experiments on seven benchmarks demonstrate that our method outperforms current state-of-the-art trackers. ETCTrack-B224 reduces the number of template tokens by 60%, leading to a 21.4% reduction in MACs with only a 0.4% drop in accuracy. The source code are available at https://github.com/PJD-WJ/ETCTrack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ETCTrack shows a learned compressor can cut template tokens 60% in multi-frame trackers for 21% MACs savings with tiny accuracy cost.

read the letter

The main thing here is that ETCTrack learns to compress historical template tokens in transformer trackers, trimming them by 60% and MACs by 21.4% while losing only 0.4% accuracy on the reported tests. They use an Adaptive Token Compressor to drop redundant tokens from past frames and keep a compact target rep, then feed those into a Hierarchical Interaction Encoder for deeper mixing with search features. This directly targets the quadratic cost and performance hit from stacking many templates, which is a real bottleneck for longer videos or lighter hardware. The results cover seven standard benchmarks and beat current trackers, and the code is released, which helps with checking the implementation. It extends existing token reduction work in vision transformers to the specific case of historical templates in tracking, with a clean compress-then-interact flow. The soft spot is that the compressor’s reliability rests mostly on overall scores; there is not much detail yet on when it might discard useful tokens during fast motion or heavy occlusion, or how the gains break down by component. The accuracy change is small on average, but per-sequence variance would clarify the limits. This paper is for people building or optimizing transformer trackers who need efficiency numbers they can actually use. Readers focused on practical ViT scaling for video tasks will get value from the concrete metrics and open code. It deserves a serious referee because the claims are testable on public data and the method is described enough to evaluate. I would send it for review.

Referee Report

2 major / 3 minor

Summary. The paper proposes ETCTrack, a compress-then-interact framework for Transformer-based visual object tracking. It introduces an Adaptive Token Compressor that dynamically filters redundant tokens from multiple historical template frames into a compact representation, followed by a Hierarchical Interaction Encoder that performs deep adaptive interactions between the compressed templates and search-region features. Experiments across seven benchmarks report that the method outperforms current state-of-the-art trackers; specifically, ETCTrack-B224 achieves a 60% reduction in template tokens, a 21.4% reduction in MACs, and only a 0.4% drop in accuracy relative to an uncompressed baseline.

Significance. If the reported efficiency-accuracy trade-off holds under rigorous controls, the work offers a practical route to scaling the number of historical templates in Transformer trackers without incurring quadratic cost or performance degradation. The learned compression approach, rather than handcrafted rules, could generalize to other token-heavy vision tasks. The public code release strengthens reproducibility.

major comments (2)

[§5, Table 2] §5 (Experiments), Table 2 and the main results paragraph: the claim that ETCTrack 'outperforms current state-of-the-art trackers' is difficult to evaluate because the paper does not state whether the listed baselines (e.g., MixFormer, OSTrack) were re-run with the same number of historical template frames that ETCTrack uses before compression. If baselines operate with fewer templates, part of the reported gain may stem from increased temporal context rather than the compression mechanism itself.
[§4.1] §4.1 (Adaptive Token Compressor): the description of the token-selection process relies on learned importance scores, yet no ablation isolates the contribution of the compressor versus the Hierarchical Interaction Encoder. Without a controlled experiment that replaces the compressor with random or uniform token subsampling while keeping the encoder fixed, it remains unclear whether the 0.4% accuracy drop is truly minimal or whether the compressor is discarding task-critical tokens that the encoder later compensates for.

minor comments (3)

[Abstract] Abstract: 'The source code are available' contains a subject-verb agreement error; should read 'The source code is available'.
[§3] §3 (Method overview): the notation for the compressed token set T' is introduced without an explicit equation relating it to the original token set T; adding a compact equation (e.g., T' = f_comp(T)) would improve readability.
[Figure 3] Figure 3 (qualitative results): the caption does not indicate whether the visualized attention maps are from the compressed or uncompressed model, making it hard to attribute the improved localization to the proposed modules.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below with clarifications and commitments to revisions that strengthen the manuscript without misrepresenting our experimental setup.

read point-by-point responses

Referee: [§5, Table 2] §5 (Experiments), Table 2 and the main results paragraph: the claim that ETCTrack 'outperforms current state-of-the-art trackers' is difficult to evaluate because the paper does not state whether the listed baselines (e.g., MixFormer, OSTrack) were re-run with the same number of historical template frames that ETCTrack uses before compression. If baselines operate with fewer templates, part of the reported gain may stem from increased temporal context rather than the compression mechanism itself.

Authors: We appreciate this observation regarding fair comparison. The baselines were evaluated following the configurations reported in their original papers, which typically use a single template frame. ETCTrack is specifically designed to compress multiple historical template frames. In the revised manuscript, we will explicitly document the number of template frames employed by each baseline. We will also add results from re-evaluating the primary baselines (MixFormer and OSTrack) using an equivalent number of historical frames prior to compression. This will more clearly attribute performance differences to the compress-then-interact framework rather than temporal context alone. revision: partial
Referee: [§4.1] §4.1 (Adaptive Token Compressor): the description of the token-selection process relies on learned importance scores, yet no ablation isolates the contribution of the compressor versus the Hierarchical Interaction Encoder. Without a controlled experiment that replaces the compressor with random or uniform token subsampling while keeping the encoder fixed, it remains unclear whether the 0.4% accuracy drop is truly minimal or whether the compressor is discarding task-critical tokens that the encoder later compensates for.

Authors: We agree that isolating the contribution of the Adaptive Token Compressor is important. We will include a new ablation study in the revised manuscript in which the learned compressor is replaced by random and uniform subsampling while the Hierarchical Interaction Encoder remains fixed. This experiment will demonstrate that the learned importance scores are responsible for preserving discriminative tokens and achieving the minimal accuracy drop, rather than the encoder compensating for suboptimal selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes ETCTrack as a compress-then-interact framework using an Adaptive Token Compressor to filter redundant template tokens and a Hierarchical Interaction Encoder for search feature interaction. All reported gains (60% token reduction, 21.4% MACs drop, 0.4% accuracy change) are stated as direct experimental outcomes on seven public benchmarks rather than quantities derived from the method's own parameters or equations. No load-bearing steps reduce by construction to inputs, self-citations, or fitted values renamed as predictions. The derivation remains empirical and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the empirical effectiveness of two newly introduced learned modules whose internal hyperparameters and training details are not visible from the abstract alone.

free parameters (1)

token compression ratio
The reported 60% reduction is achieved by the compressor and is likely controlled by a tunable threshold or learned parameter.

axioms (1)

domain assumption Redundant visual tokens in historical templates can be identified and removed without loss of target-discriminative information
This premise underpins the design of the Adaptive Token Compressor.

invented entities (2)

Adaptive Token Compressor no independent evidence
purpose: Dynamically filter redundant template tokens into a compact representation
New module introduced by the paper; no independent evidence outside the reported experiments.
Hierarchical Interaction Encoder no independent evidence
purpose: Enable deep adaptive interaction between compressed templates and search features
New module introduced by the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5543 in / 1368 out tokens · 46314 ms · 2026-05-12T01:22:41.757751+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens... Mask-Guided Token Pruning & Merging Module... greedy cosine similarity matching
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Hierarchical Interaction Encoder... multi-stage interaction process... context-aware enrichment of the template, followed by unified feature learning, and a final template-guided refinement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 1 internal anchor

[1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page
[2]

Ar- trackv2: Prompting autoregressive tracker where to look and how to describe

Yifan Bai, Zeyang Zhao, Yihong Gong, and Xing Wei. Ar- trackv2: Prompting autoregressive tracker where to look and how to describe. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19048– 19057, 2024. 3, 6

work page 2024
[3]

Fully-convolutional siamese networks for object tracking

Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. InComputer vision–ECCV 2016 workshops: Amsterdam, the Netherlands, October 8- 10 and 15-16, 2016, proceedings, part II 14, pages 850–865. Springer, 2016. 1, 2, 6

work page 2016
[4]

Hiptrack: Visual tracking with historical prompts

Wenrui Cai, Qingjie Liu, and Yunhong Wang. Hiptrack: Visual tracking with historical prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19258–19267, 2024. 6

work page 2024
[5]

Spmtrack: Spatio-temporal parameter-efficient fine-tuning with mixture of experts for scalable visual tracking

Wenrui Cai, Qingjie Liu, and Yunhong Wang. Spmtrack: Spatio-temporal parameter-efficient fine-tuning with mixture of experts for scalable visual tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16871–16881, 2025. 3, 6

work page 2025
[6]

Madtp: Multi- modal alignment-guided dynamic token pruning for accel- erating vision-language transformer

Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yan- song Tang, Jiwen Lu, and Tao Chen. Madtp: Multi- modal alignment-guided dynamic token pruning for accel- erating vision-language transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15710–15719, 2024. 3

work page 2024
[7]

Seqtrack: Sequence to sequence learning for visual ob- ject tracking

Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. Seqtrack: Sequence to sequence learning for visual ob- ject tracking. 2

work page
[8]

Transformer tracking

Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8126–8135, 2021. 2, 6

work page 2021
[9]

Mixformer: End-to-end tracking with iterative mixed atten- tion

Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed atten- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 13608–13618,

work page
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2010
[11]

Siamese cascaded region pro- posal networks for real-time visual tracking

Heng Fan and Haibin Ling. Siamese cascaded region pro- posal networks for real-time visual tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7952–7961, 2019. 2

work page 2019
[12]

Lasot: A high-quality benchmark for large-scale single ob- ject tracking

Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single ob- ject tracking. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 5, 6, 7

work page 2019
[13]

Lasot: A high-quality large-scale single object track- ing benchmark.International Journal of Computer Vi- sion,International Journal of Computer Vision, 2020

Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit Harshit, Mingzhen Huang, Jue- huan Liu, Yong Xu, Chunyuan Liao, Yuan Lin, and Haibin Ling. Lasot: A high-quality large-scale single object track- ing benchmark.International Journal of Computer Vi- sion,International Journal of Computer Vision, 2020. 6, 7

work page 2020
[14]

Stmtrack: Template-free visual tracking with space-time memory networks

Zhihong Fu, Qingjie Liu, Zehua Fu, and Yunhong Wang. Stmtrack: Template-free visual tracking with space-time memory networks. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13774–13783, 2021. 2

work page 2021
[15]

Robust tracking via learning model update with unsupervised anomaly detection philosophy.IEEE Transactions on Circuits and Systems for Video Technology, 33(5):2330–2341, 2022

Jie Gao, Bineng Zhong, and Yan Chen. Robust tracking via learning model update with unsupervised anomaly detection philosophy.IEEE Transactions on Circuits and Systems for Video Technology, 33(5):2330–2341, 2022. 2

work page 2022
[16]

Unambiguous object tracking by exploiting target cues

Jie Gao, Bineng Zhong, and Yan Chen. Unambiguous object tracking by exploiting target cues. InProceedings of the 31st ACM international conference on multimedia, pages 1997– 2005, 2023. 2

work page 1997
[17]

Dreamtrack: Dreaming the future for multi- modal visual object tracking

Mingzhe Guo, Weiping Tan, Wenyu Ran, Liping Jing, and Zhipeng Zhang. Dreamtrack: Dreaming the future for multi- modal visual object tracking. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 7201–7210, 2025. 6

work page 2025
[18]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 8

work page 2022
[19]

Target-aware tracking with long-term context at- tention

Kaijie He, Canlong Zhang, Sheng Xie, Zhixin Li, and Zhi- wen Wang. Target-aware tracking with long-term context at- tention. InProceedings of the AAAI conference on artificial intelligence, pages 773–780, 2023. 2, 3

work page 2023
[20]

Matryoshka query transformer for large vision-language models.Advances in Neural Infor- mation Processing Systems, 37:50168–50188, 2024

Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. Matryoshka query transformer for large vision-language models.Advances in Neural Infor- mation Processing Systems, 37:50168–50188, 2024. 3

work page 2024
[21]

Transformer track- ing via frequency fusion.IEEE Transactions on Circuits and Systems for Video Technology, 34(2):1020–1031, 2023

Xiantao Hu, Bineng Zhong, Qihua Liang, Shengping Zhang, Ning Li, Xianxian Li, and Rongrong Ji. Transformer track- ing via frequency fusion.IEEE Transactions on Circuits and Systems for Video Technology, 34(2):1020–1031, 2023. 2

work page 2023
[22]

Toward modalities correlation for rgb-t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(10):9102–9111, 2024

Xiantao Hu, Bineng Zhong, Qihua Liang, Shengping Zhang, Ning Li, and Xianxian Li. Toward modalities correlation for rgb-t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(10):9102–9111, 2024

work page 2024
[23]

Exploiting multimodal spatial-temporal patterns for video object tracking

Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, and Jian Yang. Exploiting multimodal spatial-temporal patterns for video object tracking. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 3581–3589, 2025

work page 2025
[24]

Adaptive perception for unified visual multi-modal object tracking.IEEE Trans- actions on Artificial Intelligence, 2025

Xiantao Hu, Bineng Zhong, Qihua Liang, Liangtao Shi, Zhiyi Mo, Ying Tai, and Jian Yang. Adaptive perception for unified visual multi-modal object tracking.IEEE Trans- actions on Artificial Intelligence, 2025. 2

work page 2025
[25]

Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, page 1562–1577, 2021

Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, page 1562–1577, 2021. 6, 7

work page 2021
[26]

Token-efficient long video understanding for multimodal llms.arXiv preprint arXiv:2503.04130, 2025

Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, et al. Token-efficient long video understanding for multimodal llms.arXiv preprint arXiv:2503.04130, 2025. 2

work page arXiv 2025
[27]

Exploring enhanced contextual information for video-level object tracking

Ben Kang, Xin Chen, Simiao Lai, Yang Liu, Yi Liu, and Dong Wang. Exploring enhanced contextual information for video-level object tracking. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 4194–4202, 2025. 6

work page 2025
[28]

Need for speed: A bench- mark for higher frame rate object tracking

Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. Need for speed: A bench- mark for higher frame rate object tracking. InProceedings of the IEEE international conference on computer vision, pages 1125–1134, 2017. 6, 7

work page 2017
[30]

High performance visual tracking with siamese region pro- posal network

Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region pro- posal network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8971–8980,

work page
[31]

Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks

Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4282–4291, 2019. 2, 6

work page 2019
[32]

Otterhd: A high-resolution multi-modality model

Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi- modality model.arXiv preprint arXiv:2311.04219, 2023. 3

work page arXiv 2023
[33]

Cadtrack: Learning contextual aggregation with deformable alignment for robust rgbt tracking.arXiv preprint arXiv:2511.17967, 2025

Hao Li, Yuhao Wang, Xiantao Hu, Wenning Hao, Pingping Zhang, Dong Wang, and Huchuan Lu. Cadtrack: Learning contextual aggregation with deformable alignment for robust rgbt tracking.arXiv preprint arXiv:2511.17967, 2025. 2

work page arXiv 2025
[34]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3, 4

work page 2023
[35]

Au- toregressive sequential pretraining for visual tracking

Shiyi Liang, Yifan Bai, Yihong Gong, and Xing Wei. Au- toregressive sequential pretraining for visual tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7254–7264, 2025. 6

work page 2025
[36]

Moe-llava: Mix- ture of experts for large vision-language models

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision- language models.arXiv preprint arXiv:2401.15947, 2024. 3

work page arXiv 2024
[37]

Swintrack: A simple and strong baseline for trans- former tracking.Advances in Neural Information Processing Systems, 35:16743–16754, 2022

Liting Lin, Heng Fan, Zhipeng Zhang, Yong Xu, and Haibin Ling. Swintrack: A simple and strong baseline for trans- former tracking.Advances in Neural Information Processing Systems, 35:16743–16754, 2022. 2

work page 2022
[38]

Lawrence Zitnick.Microsoft COCO: Common Objects in Context, page 740–755

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick.Microsoft COCO: Common Objects in Context, page 740–755. 2014. 6

work page 2014
[39]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In2017 IEEE International Conference on Computer Vision (ICCV),

work page
[40]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 3

work page 2024
[41]

Video-xl-pro: Reconstructive token compres- sion for extremely long video understanding.arXiv preprint arXiv:2503.18478, 2025

Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, and Bo Zhao. Video-xl-pro: Reconstructive token compres- sion for extremely long video understanding.arXiv preprint arXiv:2503.18478, 2025. 2

work page arXiv 2025
[42]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2

work page 2021
[43]

Hybrid-level instruction injection for video token com- pression in multi-modal large language models

Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, and Hongtao Xie. Hybrid-level instruction injection for video token com- pression in multi-modal large language models. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 8568–8578, 2025. 2

work page 2025
[44]

Internvl-x: Advanc- ing and accelerating internvl series with efficient visual token compression.arXiv preprint arXiv:2503.21307, 2025

Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jian- liang Zeng, Mao Shu, and Huo Cao. Internvl-x: Advanc- ing and accelerating internvl series with efficient visual token compression.arXiv preprint arXiv:2503.21307, 2025. 2

work page arXiv 2025
[45]

Matthias M ¨uller, Adel Bibi, Silvio Giancola, Salman Alsub- aihi, and Bernard Ghanem.TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild, page 310–327. 2018. 6, 7

work page 2018
[46]

Vast- track: Vast category visual object tracking.Advances in Neu- ral Information Processing Systems, 37:130797–130818,

Liang Peng, Junyuan Gao, Xinran Liu, Weihong Li, Shaohua Dong, Zhipeng Zhang, Heng Fan, and Libo Zhang. Vast- track: Vast category visual object tracking.Advances in Neu- ral Information Processing Systems, 37:130797–130818,

work page
[47]

Generalized in- tersection over union: A metric and a loss for bounding box regression

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. In2019 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2019. 5

work page 2019
[48]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,

work page arXiv
[49]

Explicit visual prompts for visual object tracking

Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Sheng- ping Zhang, and Xianxian Li. Explicit visual prompts for visual object tracking. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 4838–4846, 2024. 3

work page 2024
[50]

Transformer tracking with cyclic shifting window attention

Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Transformer tracking with cyclic shifting window attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8791–8800,

work page
[51]

Compact transformer tracker with correlative masked modeling

Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Compact transformer tracker with correlative masked modeling. InProceedings of the AAAI conference on artificial intelligence, pages 2321–2329, 2023. 2

work page 2023
[52]

Siamese instance search for tracking

Ran Tao, Efstratios Gavves, and Arnold WM Smeulders. Siamese instance search for tracking. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1420–1429, 2016. 1

work page 2016
[53]

Fast-itpn: Integrally pre- trained transformer pyramid network with token migration

Yunjie Tian, Lingxi Xie, Jihao Qiu, Jianbin Jiao, Yaowei Wang, Qi Tian, and Qixiang Ye. Fast-itpn: Integrally pre- trained transformer pyramid network with token migration. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2024. 2, 4, 5, 6, 8

work page 2024
[54]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

work page 2017
[55]

Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark

Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13763–13773, 2021. 6, 7

work page 2021
[56]

Autoregressive visual tracking

Xing Wei, Yifan Bai, Yongchao Zheng, Dahu Shi, and Yi- hong Gong. Autoregressive visual tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 9697–9706, 2023. 3

work page 2023
[57]

Object track- ing benchmark.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1834–1848, 2015

Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object track- ing benchmark.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1834–1848, 2015. 6, 7

work page 2015
[58]

Video- track: Learning to track objects via video transformer

Fei Xie, Lei Chu, Jiahao Li, Yan Lu, and Chao Ma. Video- track: Learning to track objects via video transformer. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 22826–22835, 2023. 2

work page 2023
[59]

Video- track: Learning to track objects via video transformer

Fei Xie, Lei Chu, Jiahao Li, Yan Lu, and Chao Ma. Video- track: Learning to track objects via video transformer. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 22826–22835, 2023. 6

work page 2023
[60]

Robust tracking via mamba- based context-aware token learning

Jinxia Xie, Bineng Zhong, Qihua Liang, Ning Li, Zhiyi Mo, and Shuxiang Song. Robust tracking via mamba- based context-aware token learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8727– 8735, 2025. 3, 6

work page 2025
[61]

Less is more: Token context-aware learning for object tracking

Chenlong Xu, Bineng Zhong, Qihua Liang, Yaozong Zheng, Guorong Li, and Shuxiang Song. Less is more: Token context-aware learning for object tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8824– 8832, 2025. 2, 4

work page 2025
[62]

Similarity- guided layer-adaptive vision transformer for uav tracking

Chaocan Xue, Bineng Zhong, Qihua Liang, Yaozong Zheng, Ning Li, Yuanliang Xue, and Shuxiang Song. Similarity- guided layer-adaptive vision transformer for uav tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 6730–6740, 2025. 2

work page 2025
[63]

Learning spatio-temporal transformer for vi- sual tracking

Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for vi- sual tracking. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 2

work page 2021
[64]

Foreground-background distribution mod- eling transformer for visual object tracking

Dawei Yang, Jianfeng He, Yinchao Ma, Qianjin Yu, and Tianzhu Zhang. Foreground-background distribution mod- eling transformer for visual object tracking. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 10117–10127, 2023. 6

work page 2023
[65]

Joint feature learning and relation modeling for tracking: A one-stream framework

Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InEuropean conference on computer vision, pages 341–357. Springer, 2022. 2, 5, 6

work page 2022
[66]

Token-level correlation-guided com- pression for efficient multimodal document understanding

Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, and Liqiang Nie. Token-level correlation-guided com- pression for efficient multimodal document understanding. arXiv preprint arXiv:2407.14439, 2024. 3

work page arXiv 2024
[67]

Hivit: A simpler and more efficient design of hierarchical vision transformer

Xiaosong Zhang, Yunjie Tian, Lingxi Xie, Wei Huang, Qi Dai, Qixiang Ye, and Qi Tian. Hivit: A simpler and more efficient design of hierarchical vision transformer. InThe Eleventh International Conference on Learning Representa- tions, 2023. 2

work page 2023
[68]

Leveraging local and global cues for visual tracking via parallel interaction network

Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhenjun Tang, Rongrong Ji, and Xianxian Li. Leveraging local and global cues for visual tracking via parallel interaction network. IEEE Transactions on Circuits and Systems for Video Tech- nology, 33(4):1671–1683, 2022. 2

work page 2022
[69]

Toward unified token learning for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023

Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, and Xianxian Li. Toward unified token learning for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023. 2

work page 2023
[70]

Odtrack: Online dense temporal token learning for visual tracking

Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. InProceed- ings of the AAAI conference on artificial intelligence, pages 7588–7596, 2024. 2, 3, 4, 6

work page 2024
[71]

Decoupled spatio-temporal consistency learning for self-supervised tracking

Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, and Shuxiang Song. Decoupled spatio-temporal consistency learning for self-supervised tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10635– 10643, 2025. 2

work page 2025
[72]

Towards universal modal tracking with online dense temporal token learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025

Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, and Rongrong Ji. Towards universal modal tracking with online dense temporal token learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025. 2

work page 2025
[73]

Focusllava: A coarse-to-fine approach for efficient and effective visual token compression.arXiv preprint arXiv:2411.14228, 2024

Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, and Sheng Guo. Focusllava: A coarse-to-fine approach for efficient and effective visual token compression.arXiv preprint arXiv:2411.14228, 2024. 2, 4

work page arXiv 2024