Active Adversarial Perturbation-driven Associative Memory Retrieval for RGB-Event Visual Object Tracking

Jin Tang; Lan Chen; Sibao Chen; Xiao Wang; Xufeng Lou; Yaowei Wang; Yonghong Tian; Zikang Yan

arxiv: 2606.26455 · v1 · pith:IKCBSLNMnew · submitted 2026-06-24 · 💻 cs.CV · cs.AI· cs.LG

Active Adversarial Perturbation-driven Associative Memory Retrieval for RGB-Event Visual Object Tracking

Xiao Wang , Xufeng Lou , Zikang Yan , Lan Chen , Sibao Chen , Yaowei Wang , Yonghong Tian , Jin Tang This is my paper

Pith reviewed 2026-06-26 01:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords RGB-Event trackingadversarial perturbationassociative memory retrievalvisual object trackingmulti-modal fusionHopfield networkmodal degradation

0 comments

The pith

A framework trains RGB-Event trackers to stay accurate when one sensor fails or the target is only partly visible by simulating those degradations with adversarial perturbations and retrieving past features through calibrated memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the problem that RGB-Event trackers lose accuracy when one modality becomes unreliable or the target is occluded, truncated, or cluttered. It does so by building a training process that actively generates two kinds of structured degradations: full loss of one sensor and localized absence of the target region. A routing mechanism keeps the two degradation types from interfering during learning, while a retrieval module uses association footprints to pull reliable historical information only from target-related memory slots. If successful, trackers would maintain localization even in harsh conditions where conventional fusion methods collapse. The approach is demonstrated through experiments on four RGB-Event tracking benchmarks.

Core claim

APRTrack constructs structured degradation via two adversarial perturbation branches at the modality and spatial levels, which separately simulate full-modal failure and localized target region absence. A hierarchical routing mechanism disentangles the training pipelines of the two perturbation types. Footprint-guided Channel-calibrated Hopfield Retrieval evaluates retrieval confidence based on association footprints between queries and memory banks, calibrates the retrieval metric space prior to Hopfield matching, and realizes controllable historical feature compensation bounded to target regions.

What carries the argument

Hierarchical adversarial perturbation branches at modality and spatial levels plus Footprint-guided Channel-calibrated Hopfield Retrieval (FCHR) that bounds memory compensation to target regions using association footprints.

If this is right

Trackers retain localization when an entire modality drops out because the training explicitly forces the model to operate without it.
Partial target absence is handled by learning to ignore or compensate for missing spatial regions rather than relying on complete appearance.
Historical features are retrieved only when association footprints indicate high relevance, reducing contamination from background or earlier errors.
The two degradation types can be trained separately without forcing the network into a collapsed feature space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same perturbation-plus-retrieval pattern could be tested on other multi-modal pairs such as RGB-infrared or RGB-LiDAR tracking.
The footprint calibration step might be replaced by learned attention masks if the Hopfield component is swapped for a transformer memory.
If the perturbations prove too specific, adding a third branch that simulates combined degradations could be checked for further gains.

Load-bearing premise

The adversarial perturbations created at training time produce degradations that closely match the sensor failures and partial target losses that actually occur in real RGB-Event videos.

What would settle it

Measure tracker accuracy on a set of real RGB-Event sequences containing modal failures or partial occlusions whose statistics differ from the two perturbation types used in training; a large drop relative to the reported results would falsify the transfer of robustness.

Figures

Figures reproduced from arXiv: 2606.26455 by Jin Tang, Lan Chen, Sibao Chen, Xiao Wang, Xufeng Lou, Yaowei Wang, Yonghong Tian, Zikang Yan.

**Figure 2.** Figure 2: An overview of the proposed APRTrack framework for missing-robust RGB-Event tracking. APRTrack first maps RGB and Event template-search inputs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Detailed architecture of (1) query-memory association footprint esti [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Success rate comparison under 14 challenging attributes on FELT. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Compensation gate dynamics of FCHR during training. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of attention maps generated by APRTrack. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

RGB-Event tracking improves localization robustness by fusing RGB appearance textures and dense temporal motion cues from event sensors. While this multi-modal scheme broadens tracking applicability, real-world scenes suffer diverse structured signal degradations that hinder traditional multi-modal fusion. In harsh environments, either modality can lose reliability drastically, and targets frequently appear incomplete due to occlusion, edge truncation and foreground clutter.To tackle the above challenges, we present a hierarchical perturbation and retrieval framework tailored for RGB-Event tracking with robustness against partial target missing and modal degradation, termed APRTrack. To mimic real-world signal corruption, APRTrack constructs structured degradation via two adversarial perturbation branches at the modality and spatial levels, which separately simulate full-modal failure and localized target region absence. A hierarchical routing mechanism is designed to disentangle the training pipelines of the two perturbation types, effectively eliminating feature collapse induced by superimposed degradation constraints. Furthermore, we devise Footprint-guided Channel-calibrated Hopfield Retrieval (FCHR) for reliable historical information compensation. This module evaluates retrieval confidence based on association footprints between queries and memory banks, and calibrates the retrieval metric space prior to Hopfield matching, realizing controllable historical feature compensation bounded to target regions. Extensive experiments on FE108, COESOT, VisEvent, and FELT datasets demonstrate the effectiveness of our proposed strategies for the RGB-Event visual object tracking. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenEvTracking

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

APRTrack adds modality and spatial adversarial branches with hierarchical routing plus footprint-calibrated Hopfield retrieval to handle degradations in RGB-Event tracking, but whether those branches actually reproduce real sensor patterns remains the open question.

read the letter

The paper's main advance is the APRTrack framework that trains RGB-Event trackers to handle modal dropouts and spatial missing via targeted adversarial perturbations, routed hierarchically, and compensated with calibrated Hopfield memory. The experiments look solid on the surface but the match between perturbations and real degradations is the part that needs scrutiny.

The combination of separate modality-level and spatial-level perturbation branches plus the hierarchical routing to prevent feature collapse is a concrete engineering step that has not appeared in exactly this form for event-based tracking. The footprint-guided calibration on the Hopfield retrieval is a useful detail that keeps the memory compensation tied to the target region rather than letting it drift.

The experiments run on FE108, COESOT, VisEvent, and FELT, which covers the usual benchmarks. Releasing code and models is the right move and makes the work easier to check.

The soft spot is the realism of the perturbations. The claim rests on them mimicking full-modal failure and localized target absence in a way that matches actual event-camera statistics such as sparse density or polarity imbalance. The abstract gives no histograms, KL numbers, or sensor-model comparisons against the real degraded sequences, so it is still possible the learned invariance is to synthetic artifacts. If the full paper contains those checks or ablations that isolate the perturbation design, the result strengthens; otherwise the robustness gain stays harder to trust.

This is a paper for people already working on multi-modal event tracking who need practical robustness tricks. A reader who cares about deployed systems in harsh conditions will find the architecture and the dataset results useful. It is coherent on its own terms and shows clear engagement with the failure modes, so it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes APRTrack, a hierarchical perturbation and retrieval framework for RGB-Event visual object tracking. It constructs structured degradations via two adversarial perturbation branches (modality-level and spatial-level) to simulate full-modal failure and localized target absence, introduces a hierarchical routing mechanism to disentangle training pipelines and avoid feature collapse, and devises Footprint-guided Channel-calibrated Hopfield Retrieval (FCHR) to enable controllable historical feature compensation. Effectiveness is demonstrated via experiments on the FE108, COESOT, VisEvent, and FELT datasets, with code and models to be released.

Significance. If the adversarial perturbations are validated to match the statistics and structure of real RGB-Event degradations (rather than introducing non-physical artifacts) and the routing/FCHR components deliver measurable robustness gains under partial target missing and modal degradation, the work could advance reliable multi-modal tracking in adverse conditions. The explicit commitment to releasing source code and pre-trained models is a clear strength for reproducibility.

major comments (2)

[Abstract] Abstract: The central robustness claim rests on the modality-level and spatial-level adversarial branches 'mimicking real-world signal corruption' by simulating full-modal failure and localized target absence. However, no quantitative validation is provided (e.g., KL divergence, event-rate histograms, polarity imbalance statistics, or comparison against real degraded sequences from FE108/COESOT) to confirm that the generated perturbations match the structure of actual event-camera noise, occlusion, or truncation rather than arbitrary adversarial patterns. This is load-bearing for whether the learned invariance transfers to real data.
[Method (hierarchical routing)] The hierarchical routing mechanism is presented as eliminating feature collapse from superimposed degradation constraints, yet the manuscript supplies no ablation or analysis (e.g., feature similarity metrics or training dynamics) showing that the disentanglement is necessary or effective. Without such evidence, it is unclear whether the reported gains on the four datasets are attributable to this component or to other factors.

minor comments (2)

[Abstract] Abstract: The phrase 'extensive experiments ... demonstrate the effectiveness of our proposed strategies' is vague; specific metrics (e.g., success rate or precision gains over baselines) should be summarized to allow readers to gauge the magnitude of improvement.
[Method (FCHR)] The FCHR module description refers to 'association footprints' and 'calibrates the retrieval metric space' without defining these terms or providing the corresponding equations; adding a short notation table or explicit formulas would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments point by point below, acknowledging where additional evidence is needed and outlining specific revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central robustness claim rests on the modality-level and spatial-level adversarial branches 'mimicking real-world signal corruption' by simulating full-modal failure and localized target absence. However, no quantitative validation is provided (e.g., KL divergence, event-rate histograms, polarity imbalance statistics, or comparison against real degraded sequences from FE108/COESOT) to confirm that the generated perturbations match the structure of actual event-camera noise, occlusion, or truncation rather than arbitrary adversarial patterns. This is load-bearing for whether the learned invariance transfers to real data.

Authors: We agree that explicit quantitative validation of the generated perturbations against real RGB-Event degradations would strengthen the claim that the simulated corruptions support transfer to real data. The current manuscript motivates the branches via domain-specific design (full-modal dropout for sensor failure and spatial masking for occlusion/truncation) but does not include direct statistical comparisons. In the revision we will add a new subsection (likely 4.3) reporting event-rate histograms, polarity imbalance statistics, and KL-divergence measurements between the adversarial outputs and real degraded sequences drawn from FE108 and COESOT. These results will be used to support or refine the perturbation parameters. revision: yes
Referee: [Method (hierarchical routing)] The hierarchical routing mechanism is presented as eliminating feature collapse from superimposed degradation constraints, yet the manuscript supplies no ablation or analysis (e.g., feature similarity metrics or training dynamics) showing that the disentanglement is necessary or effective. Without such evidence, it is unclear whether the reported gains on the four datasets are attributable to this component or to other factors.

Authors: We acknowledge that the manuscript currently states the purpose of the hierarchical routing without accompanying ablation or training-dynamics analysis. To demonstrate necessity, the revised version will include an ablation study (new Table or Figure in Section 4) that reports (i) cosine similarity between modality-level and spatial-level feature embeddings with and without routing, and (ii) training loss curves and final tracking metrics when the routing module is removed. This will clarify the contribution of the disentanglement step to the overall gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering framework with no load-bearing derivations or self-referential predictions

full rationale

The paper presents APRTrack as a proposed architecture consisting of adversarial perturbation branches, hierarchical routing, and FCHR retrieval for RGB-Event tracking. No equations, fitted parameters, or first-principles derivations are described in the provided text that would reduce any claimed robustness or performance gain to a tautology or self-definition. The method is introduced as an empirical contribution validated through experiments on FE108, COESOT, VisEvent, and FELT datasets. No self-citations are invoked as load-bearing uniqueness theorems, and no predictions are made that are statistically forced by construction from inputs. This is a standard non-circular engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented physical entities; the FCHR module is an algorithmic construct rather than a postulated entity with independent evidence.

pith-pipeline@v0.9.1-grok · 5818 in / 1098 out tokens · 22458 ms · 2026-06-26T01:10:46.116370+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 3 linked inside Pith

[1]

Transformer tracking,

X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8126–8135

2021
[2]

Mixformer: End-to-end tracking with iterative mixed attention,

Y . Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” inProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2022, pp. 13 608– 13 618

2022
[3]

Joint feature learning and relation modeling for tracking: A one-stream framework,

B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” inEuropean conference on computer vision. Springer, 2022, pp. 341– 357

2022
[4]

Seqtrack: Sequence to sequence learning for visual object tracking,

X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14 572–14 581

2023
[5]

Event- based vision: A survey,

G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidiset al., “Event- based vision: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 1, pp. 154–180, 2020

2020
[6]

Event-guided structured output tracking of fast-moving objects using a celex sensor,

J. Huang, S. Wang, M. Guo, and S. Chen, “Event-guided structured output tracking of fast-moving objects using a celex sensor,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 9, pp. 2413–2417, 2018

2018
[7]

Object tracking by jointly exploiting frame and event domain,

J. Zhang, X. Yang, Y . Fu, X. Wei, B. Yin, and B. Dong, “Object tracking by jointly exploiting frame and event domain,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 043–13 052

2021
[8]

Visevent: Reliable object tracking via collaboration of frame and event flows,

X. Wang, J. Li, L. Zhu, Z. Zhang, Z. Chen, X. Li, Y . Wang, Y . Tian, and F. Wu, “Visevent: Reliable object tracking via collaboration of frame and event flows,”IEEE Transactions on Cybernetics, vol. 54, no. 3, pp. 1997–2010, 2023

1997
[9]

Revisiting color-event based tracking: A unified network, dataset, and metric,

C. Tang, X. Wang, J. Huang, B. Jiang, L. Zhu, S. Chen, J. Zhang, Y . Wang, and Y . Tian, “Revisiting color-event based tracking: A unified network, dataset, and metric,”Pattern Recognition, p. 112718, 2025

2025
[10]

Long-term frame-event visual tracking: Benchmark dataset and baseline,

X. Wang, J. Huang, S. Wang, C. Tang, B. Jiang, Y . Tian, J. Tang, and B. Luo, “Long-term frame-event visual tracking: Benchmark dataset and baseline,”arXiv e-prints, pp. arXiv–2403, 2024

2024
[11]

Frame- event alignment and fusion network for high frame rate tracking,

J. Zhang, Y . Wang, W. Liu, M. Li, J. Bai, B. Yin, and X. Yang, “Frame- event alignment and fusion network for high frame rate tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 9781–9790

2023
[12]

Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers,

Z. Zhu, J. Hou, and D. O. Wu, “Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 045–22 055

2023
[13]

Visual prompt multi- modal tracking,

J. Zhu, S. Lai, X. Chen, D. Wang, and H. Lu, “Visual prompt multi- modal tracking,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2023, pp. 9516–9526

2023
[14]

Distractor-aware event-based tracking,

Y . Fu, M. Li, W. Liu, Y . Wang, J. Zhang, B. Yin, X. Wei, and X. Yang, “Distractor-aware event-based tracking,”IEEE Transactions on Image Processing, vol. 32, pp. 6129–6141, 2023

2023
[15]

Revisiting motion information for rgb-event tracking with mot philosophy,

T. Zhang, K. Debattista, Q. Zhang, G. Ding, and J. Han, “Revisiting motion information for rgb-event tracking with mot philosophy,” in Advances in Neural Information Processing Systems, vol. 37, 2024

2024
[16]

Exploring historical information for rgbe visual tracking with mamba,

C. Sun, J. Zhang, Y . Wang, H. Ge, Q. Xia, B. Yin, and X. Yang, “Exploring historical information for rgbe visual tracking with mamba,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 6500–6509

2025
[17]

Hopfield networks is all you need,

H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, T. Adler, L. Gruber, M. Holzleitner, M. Pavlovi ´c, G. K. Sandveet al., “Hopfield networks is all you need,”arXiv preprint arXiv:2008.02217, 2020

Pith/arXiv arXiv 2008
[18]

Event stream-based visual object tracking: A high-resolution bench- mark dataset and a novel baseline,

X. Wang, S. Wang, C. Tang, L. Zhu, B. Jiang, Y . Tian, and J. Tang, “Event stream-based visual object tracking: A high-resolution bench- mark dataset and a novel baseline,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 248–19 257

2024
[19]

Single-model and any-modality for video object tracking,

Z. Wu, J. Zheng, X. Ren, F.-A. Vasluianu, C. Ma, D. P. Paudel, L. Van Gool, and R. Timofte, “Single-model and any-modality for video object tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 19 156–19 166

2024
[20]

Sutrack: Towards simple and unified single object tracking,

X. Chen, B. Kang, W. Geng, J. Zhu, Y . Liu, D. Wang, and H. Lu, “Sutrack: Towards simple and unified single object tracking,”arXiv preprint arXiv:2412.19138, 2024

arXiv 2024
[21]

Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking,

X. Hou, J. Xing, Y . Qian, Y . Guo, S. Xin, J. Chen, K. Tang, M. Wang, Z. Jiang, L. Liuet al., “Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 551–26 561

2024
[22]

Xtrack: Multimodal training boosts rgb- x video object trackers,

Y . Tan, Z. Wu, Y . Fu, Z. Zhou, G. Sun, E. Zamfi, C. Ma, D. P. Paudel, L. Van Gool, and R. Timofte, “Xtrack: Multimodal training boosts rgb- x video object trackers,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 5734–5744

2025
[23]

Mamba-fetrack v2: Revisiting state space model for frame- event based visual object tracking,

S. Wang, J. Huang, Q. Ma, J. Gao, C. Xu, X. Wang, L. Chen, and B. Jiang, “Mamba-fetrack v2: Revisiting state space model for frame- event based visual object tracking,”arXiv preprint arXiv:2506.23783, 2025

arXiv 2025
[24]

Missing modality imagination network for emotion recognition with uncertain missing modalities,

J. Zhao, R. Li, and Q. Jin, “Missing modality imagination network for emotion recognition with uncertain missing modalities,” inProceedings IEEE TRANSACTIONS ON ***, 2026 13 of the AAAI Conference on Artificial Intelligence, vol. 35, no. 6, 2021, pp. 5680–5688

2026
[25]

Smil: Multimodal learning with severely missing modality,

M. Ma, J. Ren, L. Zhao, S. Tulyakov, C. Wu, and X. Peng, “Smil: Multimodal learning with severely missing modality,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2302–2310

2021
[26]

Multimodal prompt- ing with missing modalities for visual recognition,

Y .-L. Lee, Y .-H. Tsai, W.-C. Chiu, and C.-Y . Lee, “Multimodal prompt- ing with missing modalities for visual recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 943–14 952

2023
[27]

Multi-modal learning with missing modality via shared-specific feature modelling,

H. Wang, Y . Chen, C. Ma, J. Avery, L. Hull, and G. Carneiro, “Multi-modal learning with missing modality via shared-specific feature modelling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 878–15 887

2023
[28]

Re- covering coherent affective patterns: Addressing modality missing in multimodal sentiment analysis,

H. Huang, T. Gong, K. He, W. Wen, W. Zhang, and M. Feng, “Re- covering coherent affective patterns: Addressing modality missing in multimodal sentiment analysis,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 26, 2026, pp. 21 957–21 965

2026
[29]

Rag4dmc: Retrieval-augmented generation for data-level modality completion,

N. He, Y . Deng, S. Yue, Y . Fu, Z. Zhang, and T. Gao, “Rag4dmc: Retrieval-augmented generation for data-level modality completion,” in International Conference on Learning Representations, 2026

2026
[30]

Mora: Missing modality low-rank adaptation for visual recognition,

S. Zhao, N. Ahuja, T. Yu, T. Shen, and V . Narayanan, “Mora: Missing modality low-rank adaptation for visual recognition,” inInternational Conference on Learning Representations, 2026

2026
[31]

Miss-reid: Delivering robust multi-modality object re- identification despite missing modalities,

R. Xi, “Miss-reid: Delivering robust multi-modality object re- identification despite missing modalities,” inAdvances in Neural In- formation Processing Systems, vol. 38, 2025

2025
[32]

Inference-time dynamic modality selection for incomplete multimodal classification,

S. Du, X. Luo, D. P. O’Regan, and C. Qin, “Inference-time dynamic modality selection for incomplete multimodal classification,” inInter- national Conference on Learning Representations, 2026

2026
[33]

Transformer meets tracker: Exploiting temporal context for robust visual tracking,

N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1571–1580

2021
[34]

Learning target candidate association to keep track of what not to track,

C. Mayer, M. Danelljan, D. P. Paudel, and L. Van Gool, “Learning target candidate association to keep track of what not to track,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 444–13 454

2021
[35]

Hiptrack: Visual tracking with historical prompts,

W. Cai, Q. Liu, and Y . Wang, “Hiptrack: Visual tracking with historical prompts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 258–19 267

2024
[36]

Odtrack: Online dense temporal token learning for visual tracking,

Y . Zheng, B. Zhong, Q. Liang, Z. Mo, S. Zhang, and X. Li, “Odtrack: Online dense temporal token learning for visual tracking,” inProceed- ings of the AAAI conference on artificial intelligence, vol. 38, no. 7, 2024, pp. 7588–7596

2024
[37]

Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,

J. Xie, B. Zhong, Z. Mo, S. Zhang, L. Shi, S. Song, and R. Ji, “Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 300–19 309

2024
[38]

Exploring enhanced contextual information for video-level object tracking,

B. Kang, X. Chen, S. Lai, Y . Liu, Y . Liu, and D. Wang, “Exploring enhanced contextual information for video-level object tracking,” in Proceedings of the AAAI conference on Artificial Intelligence, vol. 39, no. 4, 2025, pp. 4194–4202

2025
[39]

Universal hopfield networks: A general framework for single-shot associative memory models,

B. Millidge, T. Salvatori, Y . Song, T. Lukasiewicz, and R. Bogacz, “Universal hopfield networks: A general framework for single-shot associative memory models,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 15 561–15 583

2022
[40]

Adaptive hopfield network: Rethinking similarities in associative memory,

S. Wang, Y . Pan, Z. Shen, M. Zhang, H. Wang, and G. Li, “Adaptive hopfield network: Rethinking similarities in associative memory,” in International Conference on Learning Representations, 2026

2026
[41]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

Pith/arXiv arXiv 2010
[42]

Cloob: Modern hopfield networks with infoloob outperform clip,

A. Fürst, E. Rumetshofer, J. Lehner, V . T. Tran, F. Tang, H. Ramsauer, D. Kreil, M. Kopp, G. Klambauer, A. Bittoet al., “Cloob: Modern hopfield networks with infoloob outperform clip,”Advances in neural information processing systems, vol. 35, pp. 20 450–20 468, 2022

2022
[43]

Outlier-efficient hopfield layers for large transformer-based models,

J. Y .-C. Hu, P.-H. Chang, R. Luo, H.-Y . Chen, W. Li, W.-P. Wang, and H. Liu, “Outlier-efficient hopfield layers for large transformer-based models,”arXiv preprint arXiv:2404.03828, 2024

arXiv 2024
[44]

Beyond scaling laws: Understand- ing transformer performance with associative memory,

X. Niu, B. Bai, L. Deng, and W. Han, “Beyond scaling laws: Understand- ing transformer performance with associative memory,”arXiv preprint arXiv:2405.08707, 2024

arXiv 2024
[45]

Exploiting memory-aware q-distribution prediction for nuclear fusion via modern hopfield network,

Q. Ma, S. Wang, T. Zheng, X. Dai, Y . Wang, Q. Yang, and X. Wang, “Exploiting memory-aware q-distribution prediction for nuclear fusion via modern hopfield network,” inInternational Conference on Brain Inspired Cognitive Systems. Springer, 2024, pp. 104–114

2024
[46]

Conformal prediction for time series with modern hopfield networks,

A. Auer, M. Gauch, D. Klotz, and S. Hochreiter, “Conformal prediction for time series with modern hopfield networks,”Advances in neural information processing systems, vol. 36, pp. 56 027–56 074, 2023

2023
[47]

Stanhop: Sparse tandem hopfield model for memory-enhanced time series prediction,

Y .-H. Wu, J. Y .-C. Hu, W. Li, B.-Y . Chen, and H. Liu, “Stanhop: Sparse tandem hopfield model for memory-enhanced time series prediction,” in International Conference on Learning Representations, vol. 2024, 2024, pp. 30 886–30 925

2024
[48]

Unsupervised domain adaptation by back- propagation,

Y . Ganin and V . Lempitsky, “Unsupervised domain adaptation by back- propagation,”Proceedings of the International Conference on Machine Learning, pp. 1180–1189, 2015

2015
[49]

Categorical reparameterization with gumbel-softmax,

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,”International Conference on Learning Representa- tions, 2017

2017
[50]

The concrete distribution: A continuous relaxation of discrete random variables,

C. J. Maddison, A. Mnih, and Y . W. Teh, “The concrete distribution: A continuous relaxation of discrete random variables,”International Conference on Learning Representations, 2017

2017
[51]

Generalized intersection over union: A metric and a loss for bounding box regression,

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2019, pp. 658–666

2019
[52]

Cornernet: Detecting objects as paired keypoints,

H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 734–750

2018
[53]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017
[54]

Hivit: A simpler and more efficient design of hierarchical vision transformer,

X. Zhang, Y . Tian, L. Xie, W. Huang, Q. Dai, Q. Ye, and Q. Tian, “Hivit: A simpler and more efficient design of hierarchical vision transformer,” inThe eleventh international conference on learning representations, 2023

2023
[55]

Siamese box adaptive network for visual tracking,

Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6668–6677

2020
[56]

Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,

Y . Xu, Z. Wang, Z. Li, Y . Yuan, and G. Yu, “Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 549–12 556

2020
[57]

Know your surroundings: Exploiting scene information for object tracking,

G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Know your surroundings: Exploiting scene information for object tracking,” in European conference on computer vision. Springer, 2020, pp. 205– 221

2020
[58]

Clnet: A compact latent network for fast adjusting siamese trackers,

X. Dong, J. Shen, L. Shao, and F. Porikli, “Clnet: A compact latent network for fast adjusting siamese trackers,” inEuropean conference on computer vision. Springer, 2020, pp. 378–395

2020
[59]

Atom: Accurate tracking by overlap maximization,

M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Atom: Accurate tracking by overlap maximization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4660– 4669

2019
[60]

Learning discrim- inative model prediction for tracking,

G. Bhat, M. Danelljan, L. V . Gool, and R. Timofte, “Learning discrim- inative model prediction for tracking,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6182–6191

2019
[61]

Probabilistic regression for visual tracking,

M. Danelljan, L. V . Gool, and R. Timofte, “Probabilistic regression for visual tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7183–7192

2020
[62]

Learning spatio-temporal transformer for visual tracking,

B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 448–10 457

2021
[63]

Aiatrack: Attention in attention for transformer visual tracking,

S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 146–164

2022
[64]

Transforming model prediction for tracking,

C. Mayer, M. Danelljan, G. Bhat, M. Paul, D. P. Paudel, F. Yu, and L. Van Gool, “Transforming model prediction for tracking,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8731–8740

2022
[65]

Cross-modality distilla- tion for multi-modal tracking,

T. Zhang, Q. Zhang, K. Debattista, and J. Han, “Cross-modality distilla- tion for multi-modal tracking,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[66]

Less is more: Token context-aware learning for object tracking,

C. Xu, B. Zhong, Q. Liang, Y . Zheng, G. Li, and S. Song, “Less is more: Token context-aware learning for object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8824–8832

2025
[67]

Fully spiking neural networks for unified frame-event object tracking,

J. Yang, L. Fan, J. Zhang, X. Lian, H. Shen, and D. Hu, “Fully spiking neural networks for unified frame-event object tracking,” vol. 38, 2026, pp. 121 132–121 163. IEEE TRANSACTIONS ON ***, 2026 14

2026
[68]

Utptrack: Towards simple and unified token pruning for visual tracking,

H. Wu, X. Wang, J. Zhang, J. Tong, X. Chen, J. Lin, Y . Ma, and X. Shen, “Utptrack: Towards simple and unified token pruning for visual tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 20 963–20 972

2026
[69]

Lastracker: A lightweight rgb-e tracking framework with ann-snn adaptive switching,

Z. Wang, S. Liu, H. Zheng, S. Wang, Y . Hu, H. Fan, Y . Li, H. Guo, and L. Deng, “Lastracker: A lightweight rgb-e tracking framework with ann-snn adaptive switching,”Pattern Recognition, p. 113623, 2026

2026
[70]

Siamcar: Siamese fully convolutional classification and regression for visual tracking,

D. Guo, J. Wang, Y . Cui, Z. Wang, and S. Chen, “Siamcar: Siamese fully convolutional classification and regression for visual tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6269–6277

2020
[71]

Siam r-cnn: Visual tracking by re-detection,

P. V oigtlaender, J. Luiten, P. H. Torr, and B. Leibe, “Siam r-cnn: Visual tracking by re-detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6578–6588

2020
[72]

Seatrack: Simple, efficient, and adaptive multimodal tracker,

J. Su, Z. Xue, S. Zhang, K. Chen, W. Hu, and Z. Zhang, “Seatrack: Simple, efficient, and adaptive multimodal tracker,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 28 679–28 689

2026
[73]

Backbone is all your need: A simplified architecture for visual object tracking,

B. Chen, P. Li, L. Bai, L. Qiao, Q. Shen, B. Li, W. Gan, W. Wu, and W. Ouyang, “Backbone is all your need: A simplified architecture for visual object tracking,” inEuropean conference on computer vision. Springer, 2022, pp. 375–392

2022
[74]

Generalized relation modeling for transformer tracking,

S. Gao, C. Zhou, and J. Zhang, “Generalized relation modeling for transformer tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 686–18 695

2023
[75]

Robust object modeling for visual tracking,

Y . Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 9589–9600

2023
[76]

Artrackv2: Prompting autore- gressive tracker where to look and how to describe,

Y . Bai, Z. Zhao, Y . Gong, and X. Wei, “Artrackv2: Prompting autore- gressive tracker where to look and how to describe,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 19 048–19 057

2024
[77]

Explicit visual prompts for visual object tracking,

L. Shi, B. Zhong, Q. Liang, N. Li, S. Zhang, and X. Li, “Explicit visual prompts for visual object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4838– 4846

2024
[78]

Exploring the feature extraction and relation modeling for light-weight transformer tracking,

J. Zheng, M. Liang, S. Huang, and J. Ning, “Exploring the feature extraction and relation modeling for light-weight transformer tracking,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 110– 126

2024
[79]

Two- stream beats one-stream: asymmetric siamese network for efficient visual tracking,

J. Zhu, H. Tang, X. Chen, X. Wang, D. Wang, and H. Lu, “Two- stream beats one-stream: asymmetric siamese network for efficient visual tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 10, 2025, pp. 10 959–10 967

2025
[80]

Learning occlusion-robust vision transformers for real-time uav tracking,

Y . Wu, X. Wang, X. Yang, M. Liu, D. Zeng, H. Ye, and S. Li, “Learning occlusion-robust vision transformers for real-time uav tracking,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 17 103–17 113

2025

Showing first 80 references.

[1] [1]

Transformer tracking,

X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8126–8135

2021

[2] [2]

Mixformer: End-to-end tracking with iterative mixed attention,

Y . Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” inProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2022, pp. 13 608– 13 618

2022

[3] [3]

Joint feature learning and relation modeling for tracking: A one-stream framework,

B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” inEuropean conference on computer vision. Springer, 2022, pp. 341– 357

2022

[4] [4]

Seqtrack: Sequence to sequence learning for visual object tracking,

X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14 572–14 581

2023

[5] [5]

Event- based vision: A survey,

G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidiset al., “Event- based vision: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 1, pp. 154–180, 2020

2020

[6] [6]

Event-guided structured output tracking of fast-moving objects using a celex sensor,

J. Huang, S. Wang, M. Guo, and S. Chen, “Event-guided structured output tracking of fast-moving objects using a celex sensor,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 9, pp. 2413–2417, 2018

2018

[7] [7]

Object tracking by jointly exploiting frame and event domain,

J. Zhang, X. Yang, Y . Fu, X. Wei, B. Yin, and B. Dong, “Object tracking by jointly exploiting frame and event domain,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 043–13 052

2021

[8] [8]

Visevent: Reliable object tracking via collaboration of frame and event flows,

X. Wang, J. Li, L. Zhu, Z. Zhang, Z. Chen, X. Li, Y . Wang, Y . Tian, and F. Wu, “Visevent: Reliable object tracking via collaboration of frame and event flows,”IEEE Transactions on Cybernetics, vol. 54, no. 3, pp. 1997–2010, 2023

1997

[9] [9]

Revisiting color-event based tracking: A unified network, dataset, and metric,

C. Tang, X. Wang, J. Huang, B. Jiang, L. Zhu, S. Chen, J. Zhang, Y . Wang, and Y . Tian, “Revisiting color-event based tracking: A unified network, dataset, and metric,”Pattern Recognition, p. 112718, 2025

2025

[10] [10]

Long-term frame-event visual tracking: Benchmark dataset and baseline,

X. Wang, J. Huang, S. Wang, C. Tang, B. Jiang, Y . Tian, J. Tang, and B. Luo, “Long-term frame-event visual tracking: Benchmark dataset and baseline,”arXiv e-prints, pp. arXiv–2403, 2024

2024

[11] [11]

Frame- event alignment and fusion network for high frame rate tracking,

J. Zhang, Y . Wang, W. Liu, M. Li, J. Bai, B. Yin, and X. Yang, “Frame- event alignment and fusion network for high frame rate tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 9781–9790

2023

[12] [12]

Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers,

Z. Zhu, J. Hou, and D. O. Wu, “Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 045–22 055

2023

[13] [13]

Visual prompt multi- modal tracking,

J. Zhu, S. Lai, X. Chen, D. Wang, and H. Lu, “Visual prompt multi- modal tracking,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2023, pp. 9516–9526

2023

[14] [14]

Distractor-aware event-based tracking,

Y . Fu, M. Li, W. Liu, Y . Wang, J. Zhang, B. Yin, X. Wei, and X. Yang, “Distractor-aware event-based tracking,”IEEE Transactions on Image Processing, vol. 32, pp. 6129–6141, 2023

2023

[15] [15]

Revisiting motion information for rgb-event tracking with mot philosophy,

T. Zhang, K. Debattista, Q. Zhang, G. Ding, and J. Han, “Revisiting motion information for rgb-event tracking with mot philosophy,” in Advances in Neural Information Processing Systems, vol. 37, 2024

2024

[16] [16]

Exploring historical information for rgbe visual tracking with mamba,

C. Sun, J. Zhang, Y . Wang, H. Ge, Q. Xia, B. Yin, and X. Yang, “Exploring historical information for rgbe visual tracking with mamba,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 6500–6509

2025

[17] [17]

Hopfield networks is all you need,

H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, T. Adler, L. Gruber, M. Holzleitner, M. Pavlovi ´c, G. K. Sandveet al., “Hopfield networks is all you need,”arXiv preprint arXiv:2008.02217, 2020

Pith/arXiv arXiv 2008

[18] [18]

Event stream-based visual object tracking: A high-resolution bench- mark dataset and a novel baseline,

X. Wang, S. Wang, C. Tang, L. Zhu, B. Jiang, Y . Tian, and J. Tang, “Event stream-based visual object tracking: A high-resolution bench- mark dataset and a novel baseline,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 248–19 257

2024

[19] [19]

Single-model and any-modality for video object tracking,

Z. Wu, J. Zheng, X. Ren, F.-A. Vasluianu, C. Ma, D. P. Paudel, L. Van Gool, and R. Timofte, “Single-model and any-modality for video object tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 19 156–19 166

2024

[20] [20]

Sutrack: Towards simple and unified single object tracking,

X. Chen, B. Kang, W. Geng, J. Zhu, Y . Liu, D. Wang, and H. Lu, “Sutrack: Towards simple and unified single object tracking,”arXiv preprint arXiv:2412.19138, 2024

arXiv 2024

[21] [21]

Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking,

X. Hou, J. Xing, Y . Qian, Y . Guo, S. Xin, J. Chen, K. Tang, M. Wang, Z. Jiang, L. Liuet al., “Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 551–26 561

2024

[22] [22]

Xtrack: Multimodal training boosts rgb- x video object trackers,

Y . Tan, Z. Wu, Y . Fu, Z. Zhou, G. Sun, E. Zamfi, C. Ma, D. P. Paudel, L. Van Gool, and R. Timofte, “Xtrack: Multimodal training boosts rgb- x video object trackers,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 5734–5744

2025

[23] [23]

Mamba-fetrack v2: Revisiting state space model for frame- event based visual object tracking,

S. Wang, J. Huang, Q. Ma, J. Gao, C. Xu, X. Wang, L. Chen, and B. Jiang, “Mamba-fetrack v2: Revisiting state space model for frame- event based visual object tracking,”arXiv preprint arXiv:2506.23783, 2025

arXiv 2025

[24] [24]

Missing modality imagination network for emotion recognition with uncertain missing modalities,

J. Zhao, R. Li, and Q. Jin, “Missing modality imagination network for emotion recognition with uncertain missing modalities,” inProceedings IEEE TRANSACTIONS ON ***, 2026 13 of the AAAI Conference on Artificial Intelligence, vol. 35, no. 6, 2021, pp. 5680–5688

2026

[25] [25]

Smil: Multimodal learning with severely missing modality,

M. Ma, J. Ren, L. Zhao, S. Tulyakov, C. Wu, and X. Peng, “Smil: Multimodal learning with severely missing modality,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2302–2310

2021

[26] [26]

Multimodal prompt- ing with missing modalities for visual recognition,

Y .-L. Lee, Y .-H. Tsai, W.-C. Chiu, and C.-Y . Lee, “Multimodal prompt- ing with missing modalities for visual recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 943–14 952

2023

[27] [27]

Multi-modal learning with missing modality via shared-specific feature modelling,

H. Wang, Y . Chen, C. Ma, J. Avery, L. Hull, and G. Carneiro, “Multi-modal learning with missing modality via shared-specific feature modelling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 878–15 887

2023

[28] [28]

Re- covering coherent affective patterns: Addressing modality missing in multimodal sentiment analysis,

H. Huang, T. Gong, K. He, W. Wen, W. Zhang, and M. Feng, “Re- covering coherent affective patterns: Addressing modality missing in multimodal sentiment analysis,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 26, 2026, pp. 21 957–21 965

2026

[29] [29]

Rag4dmc: Retrieval-augmented generation for data-level modality completion,

N. He, Y . Deng, S. Yue, Y . Fu, Z. Zhang, and T. Gao, “Rag4dmc: Retrieval-augmented generation for data-level modality completion,” in International Conference on Learning Representations, 2026

2026

[30] [30]

Mora: Missing modality low-rank adaptation for visual recognition,

S. Zhao, N. Ahuja, T. Yu, T. Shen, and V . Narayanan, “Mora: Missing modality low-rank adaptation for visual recognition,” inInternational Conference on Learning Representations, 2026

2026

[31] [31]

Miss-reid: Delivering robust multi-modality object re- identification despite missing modalities,

R. Xi, “Miss-reid: Delivering robust multi-modality object re- identification despite missing modalities,” inAdvances in Neural In- formation Processing Systems, vol. 38, 2025

2025

[32] [32]

Inference-time dynamic modality selection for incomplete multimodal classification,

S. Du, X. Luo, D. P. O’Regan, and C. Qin, “Inference-time dynamic modality selection for incomplete multimodal classification,” inInter- national Conference on Learning Representations, 2026

2026

[33] [33]

Transformer meets tracker: Exploiting temporal context for robust visual tracking,

N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1571–1580

2021

[34] [34]

Learning target candidate association to keep track of what not to track,

C. Mayer, M. Danelljan, D. P. Paudel, and L. Van Gool, “Learning target candidate association to keep track of what not to track,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 444–13 454

2021

[35] [35]

Hiptrack: Visual tracking with historical prompts,

W. Cai, Q. Liu, and Y . Wang, “Hiptrack: Visual tracking with historical prompts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 258–19 267

2024

[36] [36]

Odtrack: Online dense temporal token learning for visual tracking,

Y . Zheng, B. Zhong, Q. Liang, Z. Mo, S. Zhang, and X. Li, “Odtrack: Online dense temporal token learning for visual tracking,” inProceed- ings of the AAAI conference on artificial intelligence, vol. 38, no. 7, 2024, pp. 7588–7596

2024

[37] [37]

Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,

J. Xie, B. Zhong, Z. Mo, S. Zhang, L. Shi, S. Song, and R. Ji, “Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 300–19 309

2024

[38] [38]

Exploring enhanced contextual information for video-level object tracking,

B. Kang, X. Chen, S. Lai, Y . Liu, Y . Liu, and D. Wang, “Exploring enhanced contextual information for video-level object tracking,” in Proceedings of the AAAI conference on Artificial Intelligence, vol. 39, no. 4, 2025, pp. 4194–4202

2025

[39] [39]

Universal hopfield networks: A general framework for single-shot associative memory models,

B. Millidge, T. Salvatori, Y . Song, T. Lukasiewicz, and R. Bogacz, “Universal hopfield networks: A general framework for single-shot associative memory models,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 15 561–15 583

2022

[40] [40]

Adaptive hopfield network: Rethinking similarities in associative memory,

S. Wang, Y . Pan, Z. Shen, M. Zhang, H. Wang, and G. Li, “Adaptive hopfield network: Rethinking similarities in associative memory,” in International Conference on Learning Representations, 2026

2026

[41] [41]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

Pith/arXiv arXiv 2010

[42] [42]

Cloob: Modern hopfield networks with infoloob outperform clip,

A. Fürst, E. Rumetshofer, J. Lehner, V . T. Tran, F. Tang, H. Ramsauer, D. Kreil, M. Kopp, G. Klambauer, A. Bittoet al., “Cloob: Modern hopfield networks with infoloob outperform clip,”Advances in neural information processing systems, vol. 35, pp. 20 450–20 468, 2022

2022

[43] [43]

Outlier-efficient hopfield layers for large transformer-based models,

J. Y .-C. Hu, P.-H. Chang, R. Luo, H.-Y . Chen, W. Li, W.-P. Wang, and H. Liu, “Outlier-efficient hopfield layers for large transformer-based models,”arXiv preprint arXiv:2404.03828, 2024

arXiv 2024

[44] [44]

Beyond scaling laws: Understand- ing transformer performance with associative memory,

X. Niu, B. Bai, L. Deng, and W. Han, “Beyond scaling laws: Understand- ing transformer performance with associative memory,”arXiv preprint arXiv:2405.08707, 2024

arXiv 2024

[45] [45]

Exploiting memory-aware q-distribution prediction for nuclear fusion via modern hopfield network,

Q. Ma, S. Wang, T. Zheng, X. Dai, Y . Wang, Q. Yang, and X. Wang, “Exploiting memory-aware q-distribution prediction for nuclear fusion via modern hopfield network,” inInternational Conference on Brain Inspired Cognitive Systems. Springer, 2024, pp. 104–114

2024

[46] [46]

Conformal prediction for time series with modern hopfield networks,

A. Auer, M. Gauch, D. Klotz, and S. Hochreiter, “Conformal prediction for time series with modern hopfield networks,”Advances in neural information processing systems, vol. 36, pp. 56 027–56 074, 2023

2023

[47] [47]

Stanhop: Sparse tandem hopfield model for memory-enhanced time series prediction,

Y .-H. Wu, J. Y .-C. Hu, W. Li, B.-Y . Chen, and H. Liu, “Stanhop: Sparse tandem hopfield model for memory-enhanced time series prediction,” in International Conference on Learning Representations, vol. 2024, 2024, pp. 30 886–30 925

2024

[48] [48]

Unsupervised domain adaptation by back- propagation,

Y . Ganin and V . Lempitsky, “Unsupervised domain adaptation by back- propagation,”Proceedings of the International Conference on Machine Learning, pp. 1180–1189, 2015

2015

[49] [49]

Categorical reparameterization with gumbel-softmax,

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,”International Conference on Learning Representa- tions, 2017

2017

[50] [50]

The concrete distribution: A continuous relaxation of discrete random variables,

C. J. Maddison, A. Mnih, and Y . W. Teh, “The concrete distribution: A continuous relaxation of discrete random variables,”International Conference on Learning Representations, 2017

2017

[51] [51]

Generalized intersection over union: A metric and a loss for bounding box regression,

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2019, pp. 658–666

2019

[52] [52]

Cornernet: Detecting objects as paired keypoints,

H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 734–750

2018

[53] [53]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017

[54] [54]

Hivit: A simpler and more efficient design of hierarchical vision transformer,

X. Zhang, Y . Tian, L. Xie, W. Huang, Q. Dai, Q. Ye, and Q. Tian, “Hivit: A simpler and more efficient design of hierarchical vision transformer,” inThe eleventh international conference on learning representations, 2023

2023

[55] [55]

Siamese box adaptive network for visual tracking,

Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6668–6677

2020

[56] [56]

Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,

Y . Xu, Z. Wang, Z. Li, Y . Yuan, and G. Yu, “Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 549–12 556

2020

[57] [57]

Know your surroundings: Exploiting scene information for object tracking,

G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Know your surroundings: Exploiting scene information for object tracking,” in European conference on computer vision. Springer, 2020, pp. 205– 221

2020

[58] [58]

Clnet: A compact latent network for fast adjusting siamese trackers,

X. Dong, J. Shen, L. Shao, and F. Porikli, “Clnet: A compact latent network for fast adjusting siamese trackers,” inEuropean conference on computer vision. Springer, 2020, pp. 378–395

2020

[59] [59]

Atom: Accurate tracking by overlap maximization,

M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Atom: Accurate tracking by overlap maximization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4660– 4669

2019

[60] [60]

Learning discrim- inative model prediction for tracking,

G. Bhat, M. Danelljan, L. V . Gool, and R. Timofte, “Learning discrim- inative model prediction for tracking,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6182–6191

2019

[61] [61]

Probabilistic regression for visual tracking,

M. Danelljan, L. V . Gool, and R. Timofte, “Probabilistic regression for visual tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7183–7192

2020

[62] [62]

Learning spatio-temporal transformer for visual tracking,

B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 448–10 457

2021

[63] [63]

Aiatrack: Attention in attention for transformer visual tracking,

S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 146–164

2022

[64] [64]

Transforming model prediction for tracking,

C. Mayer, M. Danelljan, G. Bhat, M. Paul, D. P. Paudel, F. Yu, and L. Van Gool, “Transforming model prediction for tracking,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8731–8740

2022

[65] [65]

Cross-modality distilla- tion for multi-modal tracking,

T. Zhang, Q. Zhang, K. Debattista, and J. Han, “Cross-modality distilla- tion for multi-modal tracking,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[66] [66]

Less is more: Token context-aware learning for object tracking,

C. Xu, B. Zhong, Q. Liang, Y . Zheng, G. Li, and S. Song, “Less is more: Token context-aware learning for object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8824–8832

2025

[67] [67]

Fully spiking neural networks for unified frame-event object tracking,

J. Yang, L. Fan, J. Zhang, X. Lian, H. Shen, and D. Hu, “Fully spiking neural networks for unified frame-event object tracking,” vol. 38, 2026, pp. 121 132–121 163. IEEE TRANSACTIONS ON ***, 2026 14

2026

[68] [68]

Utptrack: Towards simple and unified token pruning for visual tracking,

H. Wu, X. Wang, J. Zhang, J. Tong, X. Chen, J. Lin, Y . Ma, and X. Shen, “Utptrack: Towards simple and unified token pruning for visual tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 20 963–20 972

2026

[69] [69]

Lastracker: A lightweight rgb-e tracking framework with ann-snn adaptive switching,

Z. Wang, S. Liu, H. Zheng, S. Wang, Y . Hu, H. Fan, Y . Li, H. Guo, and L. Deng, “Lastracker: A lightweight rgb-e tracking framework with ann-snn adaptive switching,”Pattern Recognition, p. 113623, 2026

2026

[70] [70]

Siamcar: Siamese fully convolutional classification and regression for visual tracking,

D. Guo, J. Wang, Y . Cui, Z. Wang, and S. Chen, “Siamcar: Siamese fully convolutional classification and regression for visual tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6269–6277

2020

[71] [71]

Siam r-cnn: Visual tracking by re-detection,

P. V oigtlaender, J. Luiten, P. H. Torr, and B. Leibe, “Siam r-cnn: Visual tracking by re-detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6578–6588

2020

[72] [72]

Seatrack: Simple, efficient, and adaptive multimodal tracker,

J. Su, Z. Xue, S. Zhang, K. Chen, W. Hu, and Z. Zhang, “Seatrack: Simple, efficient, and adaptive multimodal tracker,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 28 679–28 689

2026

[73] [73]

Backbone is all your need: A simplified architecture for visual object tracking,

B. Chen, P. Li, L. Bai, L. Qiao, Q. Shen, B. Li, W. Gan, W. Wu, and W. Ouyang, “Backbone is all your need: A simplified architecture for visual object tracking,” inEuropean conference on computer vision. Springer, 2022, pp. 375–392

2022

[74] [74]

Generalized relation modeling for transformer tracking,

S. Gao, C. Zhou, and J. Zhang, “Generalized relation modeling for transformer tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 686–18 695

2023

[75] [75]

Robust object modeling for visual tracking,

Y . Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 9589–9600

2023

[76] [76]

Artrackv2: Prompting autore- gressive tracker where to look and how to describe,

Y . Bai, Z. Zhao, Y . Gong, and X. Wei, “Artrackv2: Prompting autore- gressive tracker where to look and how to describe,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 19 048–19 057

2024

[77] [77]

Explicit visual prompts for visual object tracking,

L. Shi, B. Zhong, Q. Liang, N. Li, S. Zhang, and X. Li, “Explicit visual prompts for visual object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4838– 4846

2024

[78] [78]

Exploring the feature extraction and relation modeling for light-weight transformer tracking,

J. Zheng, M. Liang, S. Huang, and J. Ning, “Exploring the feature extraction and relation modeling for light-weight transformer tracking,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 110– 126

2024

[79] [79]

Two- stream beats one-stream: asymmetric siamese network for efficient visual tracking,

J. Zhu, H. Tang, X. Chen, X. Wang, D. Wang, and H. Lu, “Two- stream beats one-stream: asymmetric siamese network for efficient visual tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 10, 2025, pp. 10 959–10 967

2025

[80] [80]

Learning occlusion-robust vision transformers for real-time uav tracking,

Y . Wu, X. Wang, X. Yang, M. Liu, D. Zeng, H. Ye, and S. Li, “Learning occlusion-robust vision transformers for real-time uav tracking,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 17 103–17 113

2025