Recognition: unknown
SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker
Pith reviewed 2026-05-10 15:26 UTC · model grok-4.3
The pith
SEATrack aligns cross-modal attention maps and uses hierarchical experts to improve the performance-efficiency balance in multimodal object tracking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SEATrack addresses the performance-efficiency dilemma in PEFT-based multimodal tracking by first aligning cross-modal matching responses and then applying efficient global fusion. Existing two-stream methods produce conflicting attention maps because of modality-specific biases; AMG-LoRA integrates Low-Rank Adaptation with Adaptive Mutual Guidance to refine and align these maps dynamically. A Hierarchical Mixture of Experts then replaces conventional local fusion to enable global relation modeling while balancing expressiveness and computation. Together the two modules deliver measurable gains over prior state-of-the-art methods across RGB-T, RGB-D, and RGB-E tracking tasks.
What carries the argument
AMG-LoRA, which merges Low-Rank Adaptation with Adaptive Mutual Guidance to align attention maps across modalities, together with Hierarchical Mixture of Experts (HMoE) that performs efficient global cross-modal fusion.
If this is right
- Cross-modal attention alignment produces more effective joint representations than standard two-stream fusion.
- Hierarchical Mixture of Experts enables global relation modeling at lower computational cost than dense fusion layers.
- The same modules improve results on three distinct multimodal tracking settings without raising parameter counts.
- Parameter-efficient fine-tuning can retain its efficiency promise when attention conflicts are explicitly addressed.
Where Pith is reading between the lines
- The same attention-alignment idea could be tested in other two-stream vision tasks such as multimodal segmentation or depth estimation.
- If the method truly needs little hyperparameter tuning, it may reduce the engineering effort required to adapt trackers to new sensor pairs.
- Attention-map conflict may be a general symptom in any dual-encoder architecture that processes paired but non-identical inputs.
Load-bearing premise
That conflicting attention maps from modality biases are the main obstacle in current two-stream trackers and that AMG-LoRA plus HMoE will correct them without creating new accuracy or speed trade-offs.
What would settle it
A side-by-side comparison on a standard RGB-T benchmark that measures attention-map similarity between modalities before and after removing the Adaptive Mutual Guidance component; if similarity does not increase and overall tracking scores stay the same or drop, the alignment claim would be falsified.
Figures
read the original abstract
Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT's efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. \href{https://github.com/AutoLab-SAI-SJTU/SEATrack}{\textcolor{cyan}{Code is available}}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SEATrack, a two-stream multimodal tracker for RGB-T, RGB-D, and RGB-E tasks. It identifies modality-specific biases in existing methods as causing conflicting matching attention maps that hinder joint representation learning. To address this, it proposes AMG-LoRA, which combines Low-Rank Adaptation with Adaptive Mutual Guidance for dynamic cross-modal attention alignment, and a Hierarchical Mixture of Experts (HMoE) module for efficient global relation modeling in fusion. The work claims these innovations yield notable progress over state-of-the-art methods in balancing tracking performance with parameter efficiency.
Significance. If the central claims hold, the paper offers a targeted solution to the performance-efficiency trade-off in parameter-efficient fine-tuning for multimodal tracking by emphasizing attention-map alignment rather than capacity scaling. The availability of code is a positive for reproducibility. The focus on an underexplored aspect of cross-modal consistency could influence future designs in efficient vision-language or sensor-fusion trackers.
major comments (2)
- [Abstract] Abstract: The observation that modality-specific biases produce conflicting matching attention maps is presented as the primary motivation and bottleneck, yet no quantitative metric (such as inter-modal attention cosine similarity, KL divergence between maps, or overlap statistics) is supplied to establish the severity of the conflict in baselines or its correlation with tracking accuracy.
- [Abstract] Abstract: The claim that AMG-LoRA plus HMoE resolves the alignment issue without introducing new trade-offs rests on high-level description; no ablation results, attention-map visualizations with before/after comparisons, or efficiency breakdowns (e.g., FLOPs or parameter counts per component) are referenced to demonstrate causality or the absence of hidden costs.
minor comments (1)
- [Abstract] The abstract states that code is available but does not include the repository URL or access instructions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the abstract with additional references to supporting evidence from the main text.
read point-by-point responses
-
Referee: [Abstract] Abstract: The observation that modality-specific biases produce conflicting matching attention maps is presented as the primary motivation and bottleneck, yet no quantitative metric (such as inter-modal attention cosine similarity, KL divergence between maps, or overlap statistics) is supplied to establish the severity of the conflict in baselines or its correlation with tracking accuracy.
Authors: We acknowledge that the abstract does not include quantitative metrics to support the observation of conflicting attention maps. The manuscript provides qualitative visualizations and discussion in Section 3, but we agree that adding quantitative support would strengthen the motivation. In the revised version, we will incorporate metrics such as average inter-modal attention cosine similarity and its correlation with tracking accuracy (computed on RGB-T baselines) into the main text and add a concise reference in the abstract. revision: yes
-
Referee: [Abstract] Abstract: The claim that AMG-LoRA plus HMoE resolves the alignment issue without introducing new trade-offs rests on high-level description; no ablation results, attention-map visualizations with before/after comparisons, or efficiency breakdowns (e.g., FLOPs or parameter counts per component) are referenced to demonstrate causality or the absence of hidden costs.
Authors: We agree that the abstract would benefit from referencing the supporting analyses already present in the manuscript. Ablation studies (Section 4.3), before/after attention map visualizations (Figure 4), and efficiency breakdowns (Table 2 with parameter counts and FLOPs) demonstrate the contributions and lack of new trade-offs. In the revision, we will update the abstract to briefly reference these results (e.g., 'validated through ablations and efficiency analysis') to better substantiate the claims while keeping the abstract concise. revision: yes
Circularity Check
No circularity detected; method proposal contains no derivations or self-referential reductions
full rationale
The paper introduces SEATrack as an empirical architecture combining AMG-LoRA (LoRA plus adaptive mutual guidance for attention alignment) and HMoE (hierarchical mixture of experts for fusion). No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs or prior self-citations. The central narrative rests on an observational claim about modality biases and conflicting attention maps, followed by proposed modules whose effectiveness is asserted via experimental results rather than algebraic equivalence. Standard PEFT and MoE building blocks are adapted without self-definitional loops or load-bearing uniqueness theorems imported from the same authors. This is a typical engineering contribution whose validity hinges on external benchmarks, not internal definitional closure.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Unified Multimodal Visual Tracking with Dual Mixture-of-Experts
OneTrackerV2 unifies multimodal tracking via Meta Merger and Dual Mixture-of-Experts to reach state-of-the-art results on five tasks and 12 benchmarks with efficiency and robustness when modalities are missing.
Reference graph
Works this paper leans on
-
[1]
Spmtrack: Spatio-temporal parameter-efficient fine-tuning with mixture of experts for scalable visual tracking
Wenrui Cai, Qingjie Liu, and Yunhong Wang. Spmtrack: Spatio-temporal parameter-efficient fine-tuning with mixture of experts for scalable visual tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16871–16881, 2025. 2, 4
2025
-
[2]
Bi- directional adapter for multimodal tracking
Bing Cao, Junliang Guo, Pengfei Zhu, and Qinghua Hu. Bi- directional adapter for multimodal tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 927– 935, 2024. 2, 3, 4, 6
2024
-
[3]
Smstracker: Tri-path score mask sigma fusion for multi-modal tracking
Sixian Chan, Zedong Li, Wenhao Li, Shijian Lu, Chunhua Shen, and Xiaoqin Zhang. Smstracker: Tri-path score mask sigma fusion for multi-modal tracking. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4766–4775, 2025. 5, 6
2025
-
[4]
Simplifying cross-modal interaction via modality-shared features for rgbt tracking
Liqiu Chen, Yuqing Huang, Hengyu Li, Zikun Zhou, and Zhenyu He. Simplifying cross-modal interaction via modality-shared features for rgbt tracking. InProceedings of the 32nd ACM International Conference on Multimedia, pages 1573–1582, 2024. 2, 6
2024
-
[5]
emoe-tracker: Environmental moe-based transformer for robust event-guided object track- ing, 2024
Yucheng Chen and Lin Wang. emoe-tracker: Environmental moe-based transformer for robust event-guided object track- ing, 2024. 2, 4, 6
2024
-
[6]
Generalized relation modeling for transformer tracking
Shenyuan Gao, Chunluan Zhou, and Jun Zhang. Generalized relation modeling for transformer tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18686–18695, 2023. 2
2023
-
[7]
Deep adaptive fusion network for high perfor- mance rgbt tracking
Yuan Gao, Chenglong Li, Yabin Zhu, Jin Tang, Tao He, and Futian Wang. Deep adaptive fusion network for high perfor- mance rgbt tracking. InProceedings of the IEEE/CVF In- ternational conference on computer vision workshops, pages 0–0, 2019. 2
2019
-
[8]
Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg- Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366, 2021. 1, 3, 7
-
[9]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Onetracker: Unifying visual object tracking with foundation models and efficient tuning
Lingyi Hong, Shilin Yan, Renrui Zhang, Wanyun Li, Xinyu Zhou, Pinxue Guo, Kaixun Jiang, Yiting Chen, Jinglun Li, Zhaoyu Chen, et al. Onetracker: Unifying visual object tracking with foundation models and efficient tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19079–19091, 2024. 2, 3, 5, 6
2024
-
[11]
Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking
Xiaojun Hou, Jiazheng Xing, Yijie Qian, Yaowei Guo, Shuo Xin, Junhao Chen, Kai Tang, Mengmeng Wang, Zhengkai Jiang, Liang Liu, et al. Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 26551–26561, 2024. 1, 2, 3, 5, 6
2024
-
[12]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019. 3
2019
-
[13]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3, 7, 1, 2
2022
-
[14]
Transformer track- ing via frequency fusion.IEEE Transactions on Circuits and Systems for Video Technology, 34(2):1020–1031, 2023
Xiantao Hu, Bineng Zhong, Qihua Liang, Shengping Zhang, Ning Li, Xianxian Li, and Rongrong Ji. Transformer track- ing via frequency fusion.IEEE Transactions on Circuits and Systems for Video Technology, 34(2):1020–1031, 2023. 3
2023
-
[15]
Toward modalities correlation for rgb-t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(10):9102–9111, 2024
Xiantao Hu, Bineng Zhong, Qihua Liang, Shengping Zhang, Ning Li, and Xianxian Li. Toward modalities correlation for rgb-t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(10):9102–9111, 2024. 3
2024
-
[16]
Towards modalities correlation for rgb-t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 2024
Xiantao Hu, Bineng Zhong, Qihua Liang, Shengping Zhang, Ning Li, and Xianxian Li. Towards modalities correlation for rgb-t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 2024. 2, 4
2024
-
[17]
Exploiting multimodal spatial-temporal patterns for video object tracking
Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, and Jian Yang. Exploiting multimodal spatial-temporal patterns for video object tracking. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 3581–3589, 2025. 2, 3
2025
-
[18]
Adaptive perception for unified visual multi-modal object tracking.IEEE Trans- actions on Artificial Intelligence, 2025
Xiantao Hu, Bineng Zhong, Qihua Liang, Liangtao Shi, Zhiyi Mo, Ying Tai, and Jian Yang. Adaptive perception for unified visual multi-modal object tracking.IEEE Trans- actions on Artificial Intelligence, 2025. 3
2025
-
[19]
V op: Text-video co- operative prompt tuning for cross-modal retrieval
Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, and Donglin Wang. V op: Text-video co- operative prompt tuning for cross-modal retrieval. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6565–6574, 2023. 1, 3
2023
-
[20]
Bridging search region interaction with template for rgb-t tracking
Tianrui Hui, Zizheng Xun, Fengguang Peng, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, and Si Liu. Bridging search region interaction with template for rgb-t tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13630– 13639, 2023. 2, 3, 6
2023
-
[21]
Vi- sual prompt tuning
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 1, 3
2022
-
[22]
The tenth visual object tracking vot2022 challenge re- sults
Matej Kristan, Ale ˇs Leonardis, Jiˇr´ı Matas, Michael Felsberg, Roman Pflugfelder, Joni-Kristian K ¨am¨ar¨ainen, Hyung Jin Chang, Martin Danelljan, Luka ˇCehovin Zajc, Alan Lukeˇziˇc, et al. The tenth visual object tracking vot2022 challenge re- sults. InEuropean Conference on Computer Vision, pages 431–460. Springer, 2022. 6
2022
-
[23]
The power of scale for parameter-efficient prompt tuning, 2021
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning, 2021. 3
2021
-
[24]
Rgb-t object tracking: Benchmark and baseline.Pat- tern Recognition, 96:106977, 2019
Chenglong Li, Xinyan Liang, Yijuan Lu, Nan Zhao, and Jin Tang. Rgb-t object tracking: Benchmark and baseline.Pat- tern Recognition, 96:106977, 2019. 2, 6
2019
-
[25]
Lasher: A large-scale high- diversity benchmark for rgbt tracking.IEEE Transactions on Image Processing, 31:392–404, 2021
Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. Lasher: A large-scale high- diversity benchmark for rgbt tracking.IEEE Transactions on Image Processing, 31:392–404, 2021. 2, 5, 6, 7, 1
2021
-
[26]
Hao Li, Yuhao Wang, Xiantao Hu, Wenning Hao, Pingping Zhang, Dong Wang, and Huchuan Lu. Cadtrack: Learning contextual aggregation with deformable alignment for robust rgbt tracking.arXiv preprint arXiv:2511.17967, 2025. 3
-
[27]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 3
2022
-
[28]
Tracking meets lora: Faster training, larger model, stronger performance
Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, and Haibin Ling. Tracking meets lora: Faster training, larger model, stronger performance. InEuropean Confer- ence on Computer Vision, pages 300–318. Springer, 2024. 2
2024
-
[29]
From sparse to soft mixtures of experts, 2024
Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts, 2024. 7
2024
-
[30]
Dal: A deep depth- aware long-term tracker
Yanlin Qian, Song Yan, Alan Luke ˇziˇc, Matej Kristan, Joni- Kristian K ¨am¨ar¨ainen, and Ji ˇr´ı Matas. Dal: A deep depth- aware long-term tracker. In2020 25th International con- ference on pattern recognition (ICPR), pages 7825–7832. IEEE, 2021. 2
2021
-
[31]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3
2021
-
[32]
Explicit visual prompts for visual object tracking
Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Sheng- ping Zhang, and Xianxian Li. Explicit visual prompts for visual object tracking. InProceedings of the AAAI confer- ence on artificial intelligence, pages 4838–4846, 2024. 2
2024
-
[33]
Swimvg: Step-wise multimodal fusion and adaption for visual grounding.IEEE Transactions on Multimedia, 2025
Liangtao Shi, Ting Liu, Xiantao Hu, Yue Hu, Quanjun Yin, and Richang Hong. Swimvg: Step-wise multimodal fusion and adaption for visual grounding.IEEE Transactions on Multimedia, 2025. 3
2025
-
[34]
Mamba adapter: Effi- cient multi-modal fusion for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology,
Liangtao Shi, Bineng Zhong, Qihua Liang, Xiantao Hu, Zhiyi Mo, and Shuxiang Song. Mamba adapter: Effi- cient multi-modal fusion for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology,
-
[35]
Exploring his- torical information for rgbe visual tracking with mamba
Chuanyu Sun, Jiqing Zhang, Yang Wang, Huilin Ge, Qianchen Xia, Baocai Yin, and Xin Yang. Exploring his- torical information for rgbe visual tracking with mamba. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 6500–6509, 2025. 6
2025
-
[36]
Xtrack: Multimodal train- ing boosts rgb-x video object trackers, 2024
Yuedong Tan, Zongwei Wu, Yuqian Fu, Zhuyun Zhou, Guolei Sun, Eduard Zamfi, Chao Ma, Danda Pani Paudel, Luc Van Gool, and Radu Timofte. Xtrack: Multimodal train- ing boosts rgb-x video object trackers, 2024. 2, 3
2024
-
[37]
Xtrack: Multimodal training boosts rgb-x video object trackers
Yuedong Tan, Zongwei Wu, Yuqian Fu, Zhuyun Zhou, Guolei Sun, Eduard Zamfir, Chao Ma, Danda Paudel, Luc Van Gool, and Radu Timofte. Xtrack: Multimodal training boosts rgb-x video object trackers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5734–5744, 2025. 2, 3, 4, 5, 6
2025
-
[38]
M 3 track: Meta-prompt for multi-modal tracking.IEEE Signal Processing Letters, 2025
Zhangyong Tang, Tianyang Xu, Xiao-Jun Wu, and Josef Kit- tler. M 3 track: Meta-prompt for multi-modal tracking.IEEE Signal Processing Letters, 2025. 6
2025
-
[39]
Temporal adaptive rgbt tracking with modality prompt
Hongyu Wang, Xiaotao Liu, Yifan Li, Meng Sun, Dian Yuan, and Jing Liu. Temporal adaptive rgbt tracking with modality prompt. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5436–5444, 2024. 6
2024
-
[40]
Mambafusion: Height-fidelity dense global fusion for multi- modal 3d object detection, 2025
Hanshi Wang, Jin Gao, Weiming Hu, and Zhipeng Zhang. Mambafusion: Height-fidelity dense global fusion for multi- modal 3d object detection, 2025. 3
2025
-
[41]
Prior knowledge-driven hybrid prompter learning for rgb- event tracking.IEEE Transactions on Circuits and Systems for Video Technology, 2025
Mianzhao Wang, Fan Shi, Xu Cheng, and Shengyong Chen. Prior knowledge-driven hybrid prompter learning for rgb- event tracking.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 6
2025
-
[42]
Vi- sevent: Reliable object tracking via collaboration of frame and event flows.IEEE Transactions on Cybernetics, 54(3): 1997–2010, 2023
Xiao Wang, Jianing Li, Lin Zhu, Zhipeng Zhang, Zhe Chen, Xin Li, Yaowei Wang, Yonghong Tian, and Feng Wu. Vi- sevent: Reliable object tracking via collaboration of frame and event flows.IEEE Transactions on Cybernetics, 54(3): 1997–2010, 2023. 5, 6, 1
1997
-
[43]
Single-model and any-modality for video ob- ject tracking
Zongwei Wu, Jilai Zheng, Xiangxuan Ren, Florin-Alexandru Vasluianu, Chao Ma, Danda Pani Paudel, Luc Van Gool, and Radu Timofte. Single-model and any-modality for video ob- ject tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19156– 19166, 2024. 2, 3, 5, 6
2024
-
[44]
Cross-modulated atten- tion transformer for rgbt tracking
Yun Xiao, Jiacong Zhao, Andong Lu, Chenglong Li, Bing Yin, Yin Lin, and Cong Liu. Cross-modulated atten- tion transformer for rgbt tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8682– 8690, 2025. 2, 3, 6
2025
-
[45]
Depthtrack: Un- veiling the power of rgbd tracking
Song Yan, Jinyu Yang, Jani K ¨apyl¨a, Feng Zheng, Ale ˇs Leonardis, and Joni-Kristian K ¨am¨ar¨ainen. Depthtrack: Un- veiling the power of rgbd tracking. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10725–10733, 2021. 2, 3, 5, 6, 1
2021
-
[46]
Prompting for multi-modal tracking
Jinyu Yang, Zhe Li, Feng Zheng, Ales Leonardis, and Jingkuan Song. Prompting for multi-modal tracking. InPro- ceedings of the 30th ACM international conference on mul- timedia, pages 3492–3500, 2022. 3, 6
2022
-
[47]
Joint feature learning and relation modeling for tracking: A one-stream framework
Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InEuropean Conference on Computer Vision, pages 341–357. Springer, 2022. 2, 3, 5, 6, 8
2022
-
[48]
Is-fusion: Instance-scene collaborative fusion for multimodal 3d ob- ject detection
Junbo Yin, Jianbing Shen, Runnan Chen, Wei Li, Ruigang Yang, Pascal Frossard, and Wenguan Wang. Is-fusion: Instance-scene collaborative fusion for multimodal 3d ob- ject detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14905– 14915, 2024. 3
2024
-
[49]
Temporal adaptive bidirectional bridging for rgb-d tracking.Pattern Recognition, 158:111053, 2025
Ge Ying, Dawei Zhang, Zhou Ou, Xiao Wang, and Zhon- glong Zheng. Temporal adaptive bidirectional bridging for rgb-d tracking.Pattern Recognition, 158:111053, 2025. 6
2025
-
[50]
Visual adapt for rgbd tracking
Guangtong Zhang, Qihua Liang, Zhiyi Mo, Ning Li, and Bi- neng Zhong. Visual adapt for rgbd tracking. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9391–9395. IEEE, 2024. 6
2024
-
[51]
Jointly modeling motion and appearance cues for robust rgb-t tracking.IEEE Trans- actions on Image Processing, 30:3335–3347, 2021
Pengyu Zhang, Jie Zhao, Chunjuan Bo, Dong Wang, Huchuan Lu, and Xiaoyun Yang. Jointly modeling motion and appearance cues for robust rgb-t tracking.IEEE Trans- actions on Image Processing, 30:3335–3347, 2021. 2
2021
-
[52]
Amnet: Learning to align multi-modality for rgb- t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(8):7386–7400, 2024
Tianlu Zhang, Xiaoyi He, Qiang Jiao, Qiang Zhang, and Jun- gong Han. Amnet: Learning to align multi-modality for rgb- t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(8):7386–7400, 2024. 3
2024
-
[53]
Cross-modality distillation for multi-modal tracking
Tianlu Zhang, Qiang Zhang, Kurt Debattista, and Jungong Han. Cross-modality distillation for multi-modal tracking. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025. 6
2025
-
[54]
Odtrack: Online dense temporal token learning for visual tracking
Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. InProceed- ings of the AAAI conference on artificial intelligence, pages 7588–7596, 2024. 2
2024
-
[55]
Visual prompt multi-modal tracking
Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, and Huchuan Lu. Visual prompt multi-modal tracking. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9516–9526, 2023. 1, 2, 3, 5, 6, 7 SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker Supplementary Material
2023
-
[56]
Overall In this supplementary material, we provide more explo- ration and analyses of the proposed Adaptive Mutual Guid- ance Low-Rank Adaptation (AMG-LoRA) and Hierarchical Mixture of Experts (HMoE), which are difficult to describe in the main paper due to space limitations. Specifically, the content of the supplementary material is organized below: • Th...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.