SENTRY: SAM2-Enhanced Neighbor-Aware and Temporally Reasoned Memory for Visual Tracking

Hasan AlMarzouqi; Mohamad Alansari; Muzammal Naseer; Naoufel Werghi; Sajid Javed; Yonathan Michael

arxiv: 2606.24449 · v2 · pith:LL2XGNTFnew · submitted 2026-06-23 · 💻 cs.CV

SENTRY: SAM2-Enhanced Neighbor-Aware and Temporally Reasoned Memory for Visual Tracking

Mohamad Alansari , Yonathan Michael , Hasan AlMarzouqi , Muzammal Naseer , Naoufel Werghi , Sajid Javed This is my paper

Pith reviewed 2026-06-26 00:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual object trackingSAM2memory updatetemporal consistencycycle-consistent matchingzero-shot trackingplug-and-play moduledrift reduction

0 comments

The pith

Replacing confidence-only memory writes with neighbor-aware temporal consistency checks stabilizes SAM2-based visual trackers without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies confidence-only mask selection as the main source of drift in SAM2 trackers during occlusion, rapid motion, and distractors. SENTRY inserts a training-free step before each memory write that generates multiple segmentation hypotheses per frame, forms short tracklets, and validates them through neighbor-aware cycle-consistent matching against recent trajectories. This plug-and-play replacement of the write rule improves results when added to existing models. A sympathetic reader would care because the change requires no model updates yet produces gains across many datasets while preserving speed.

Core claim

SENTRY is a refine-before-write module that aggregates diverse segmentation hypotheses, backtracks them into short tracklets, and uses neighbor-aware cycle-consistent matching to enforce short-horizon temporal and geometric consistency before committing any mask to memory, replacing confidence-driven writes in unmodified SAM2 architectures and yielding consistent gains across nine benchmarks with new zero-shot state-of-the-art results on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi.

What carries the argument

The SENTRY refine-before-write module that validates memory updates via neighbor-aware cycle-consistent matching on short tracklets formed from multiple per-frame segmentation hypotheses.

If this is right

Consistent gains appear when SENTRY is added to five strong SAM2 baselines across nine benchmarks.
New zero-shot state-of-the-art results are reached on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi.
The SAM2-L version maintains 32.8 FPS on A100 hardware with only 0.4-0.6 GB added VRAM.
The first unified all-scale evaluation of SAM2-based trackers is provided.
Enforcing temporal validity at write time stabilizes memory-augmented tracking without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same write-time validation pattern could apply to memory mechanisms in other video models that rely on stored features or masks.
Testing longer tracklet horizons or additional geometric constraints would show whether the current short-horizon choice is optimal or merely sufficient.
Similar consistency checks might reduce drift in related tasks such as video object segmentation or multi-object tracking.
Many memory-augmented systems may benefit more from stricter write rules than from later correction stages.

Load-bearing premise

Short-horizon temporal consistency measured by neighbor-aware cycle-consistent matching is a sufficient proxy for identifying correct segmentation masks under occlusion, rapid motion, and distractors.

What would settle it

Integrating SENTRY into the five evaluated baselines and measuring no consistent improvement on the nine benchmarks would show that the temporal validation step does not stabilize tracking.

Figures

Figures reproduced from arXiv: 2606.24449 by Hasan AlMarzouqi, Mohamad Alansari, Muzammal Naseer, Naoufel Werghi, Sajid Javed, Yonathan Michael.

**Figure 1.** Figure 1: SENTRY performs favorably against SAM2-based variants across major tracking benchmarks. We denote SENTRY applied to SAM2, SAMURAI, and DAM4SAM as SENTRY-S2, SENTRY-SR, and SENTRY-D4S. (a) AUC per benchmark; (b) average AUC across [22, 23, 47, 60]. SAM2 [51] follows this trend with streaming memory and multiple mask hypotheses per frame. Its main weakness is confidence-driven mask selection: the highest-… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison under occlusion, abrupt motion, and distractor interference. We visualize decoder attention to highlight spatial focus during tracking. (a) SAM2 [51] loses target identity after occlusion, (b) SAMURAI [71] drifts under rapid motion, and (c) DAM4SAM [58] misidentifies distractors due to heuristic filtering. The last row shows SENTRY maintaining accurate localization and consistent ma… view at source ↗

**Figure 3.** Figure 3: Overview of the SENTRY framework. Per-frame candidates combine Automatic Mask Generation (AMG) proposals (A–B) and decoder hypotheses (D) into a joint set (C). Each candidate is evaluated through cycle-consistent, neighbor-aware bipartite matching in trajectory space (E–F). The most consistent mask is written to memory (G–H) following the baseline update schedule. 3.2 SENTRY SENTRY is a training-free, arch… view at source ↗

**Figure 4.** Figure 4: AUC of attributes on (a) LaSOT [23], (b) LaSOText [22], and (c) TNL2K [60]. tive gains over baselines include 1.2% AO, 1.3% SR0.5, and 1.3% SR0.75 for DAM4SAM; 0.4%, 0.1%, and 1.3% for SAM2; and 0.1%, 0.1%, and 0.3% for SAMURAI. GOT-10k’s category and appearance diversity induces mild inconsistencies that SENTRY stabilizes. TrackingNet [47]. All SENTRY variants improve over their respective baselines. Int… view at source ↗

**Figure 5.** Figure 5: Acc-Rob plot on (a) VOT20 [32], (b) VOT22 [31], (c) VOTS24 [33], and (d) DiDi [58] for SAM2 and SENTRY variants. The Q is given at each label. term trajectory coherence. Row 4: after an extreme long-term occlusion (\approx 100 frames), all methods fail to recover the tiny target, highlighting persistent reidentification limits. 4.4 Ablation Study Scalability across SAM2 sizes and base frameworks. We integ… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of baselines and SENTRY variants: (1) SAM2 vs. SENTRY-S2, (2) SAMURAI vs. SENTRY-SR, (3) DAM4SAM vs. SENTRY-D4S. (4) Failure case: all methods lose the target after a long out-of-view period (100 frames) with a tiny target. they enter memory. Candidate mask generation. Tab. 6 compares candidategeneration strategies within SENTRY. Detection–segmentation pipelines such as GroundingDIN… view at source ↗

**Figure 7.** Figure 7: Our proposed SENTRY performs favorably against SAM2-based variants across different model scales on LaSOT [23] benchmark. B Memory-Write Diagnostic Analysis To directly test whether SENTRY reduces false-positive memory contamination, we analyze annotation-available testing and validation splits at the fixed memory-update times used by each host tracker. For each written mask Wt, we compute st = J (Wt, Gt)… view at source ↗

read the original abstract

We revisit the memory update mechanism in SAM2-based visual object tracking and identify confidence-only mask selection as the dominant cause of drift under occlusion, rapid motion, and distractors. We introduce SENTRY, a training-free, plug-and-play, refine-before-write module that validates each memory update for short-horizon temporal consistency before committing it. SENTRY aggregates diverse segmentation hypotheses per frame, backtracks them into short tracklets, and uses neighbor-aware cycle-consistent matching against recent trajectories to favor temporally and geometrically consistent masks. It leaves the base architecture untouched, replacing confidence-driven writes with consistency-validated ones. For fair evaluation, we re-evaluate major open-source SAM2-based trackers across all available scales and datasets, filling gaps in prior reports. Integrated into five strong baselines, SENTRY delivers consistent gains across nine benchmarks, achieving new zero-shot SOTA on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi. Despite these checks, the SAM2-L version runs at 32.8 FPS on an A100, and across compatible hosts adds only about 0.4--0.6 GB VRAM. Our results provide the first unified all-scale evaluation of SAM2-based trackers and show that enforcing temporal validity at write time stabilizes memory-augmented tracking without retraining. Project page: https://hamadya.github.io/SENTRY/page/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SENTRY adds a training-free consistency check to SAM2 memory writes and reports steady benchmark gains, but the mechanism's ability to reject coherent distractors remains the key untested assumption.

read the letter

The new piece is the refine-before-write step: it pulls multiple segmentation hypotheses per frame, builds short tracklets by backtracking, then scores them with neighbor-aware cycle-consistent matching against recent trajectories. This replaces the usual confidence-driven memory commit in SAM2 trackers. They plug it into five existing baselines and show consistent lifts on nine datasets, including new zero-shot numbers on LaSOT, GOT-10k, and the VOT sets. They also re-ran the open-source baselines across scales to fill in missing numbers, which makes the comparisons more credible.

The practical upside is clear: no retraining, small speed and memory cost, and the gains appear across different base models. That kind of drop-in fix is the sort of thing tracking people actually try.

The soft spot is exactly the one the stress-test note flags. The method assumes short-horizon temporal and geometric consistency will reliably prefer the true mask over a drifted or distractor mask that happens to look coherent in the same window. If motion patterns align or all hypotheses inherit the same error, the check can still write bad data. The abstract gives no ablations on hard occlusion or rapid distractor cases, and no statistical tests on the reported deltas, so the load-bearing claim rests on the benchmark tables alone.

This is aimed at researchers who already use or extend SAM2 for video tracking. It is solid enough on the empirical side to deserve referee time, even if the mechanism needs tighter validation in review.

Referee Report

3 major / 2 minor

Summary. The paper claims that confidence-only mask selection in SAM2-based trackers causes drift under occlusion, rapid motion, and distractors; SENTRY is a training-free, plug-and-play module that aggregates segmentation hypotheses, forms short tracklets, and applies neighbor-aware cycle-consistent matching to validate temporal and geometric consistency before memory writes. Replacing confidence-driven updates with these validated writes into five baselines yields consistent gains across nine benchmarks and new zero-shot SOTA on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi, while adding negligible overhead (32.8 FPS and 0.4-0.6 GB VRAM for SAM2-L) and providing the first unified all-scale evaluation of SAM2 trackers.

Significance. If the empirical results hold, the work demonstrates that enforcing short-horizon temporal validity at write time can stabilize memory-augmented tracking without retraining or architectural changes, offering a lightweight, generalizable improvement for foundation-model trackers. The fair re-evaluation of open-source baselines across scales and the reporting of runtime/VRAM metrics are concrete strengths that aid reproducibility and comparison in the field.

major comments (3)

[§3] §3 (SENTRY mechanism): The central claim that neighbor-aware cycle-consistent matching reliably selects correct masks rests on the untested assumption that short-horizon consistency is a sufficient proxy under occlusion/rapid motion/distractors; no ablation, failure-case analysis, or quantitative comparison of scores for true vs. drifted but temporally coherent hypotheses is provided to substantiate this load-bearing step.
[§4] Experimental evaluation (throughout §4 and tables): Gains and new SOTA claims are reported after re-evaluating baselines, but the manuscript does not specify the exact dataset splits, versions, or statistical significance tests used; without these, the reported improvements cannot be independently verified and the cross-benchmark consistency claim is difficult to assess.
[§3.2] §3.2 (cycle-consistent matching): The aggregation of hypotheses into tracklets and scoring procedure is described at a high level, but lacks explicit equations or pseudocode for the neighbor-aware component; this makes it impossible to determine whether the method can assign high consistency scores when all hypotheses share a common error pattern (e.g., aligned distractor motion).

minor comments (2)

[Abstract] Abstract and §4: The phrase 'new zero-shot SOTA' should be accompanied by explicit comparison tables showing the previous best scores and the exact margins achieved.
[§4] Figure captions and §4: Several result tables lack error bars or run-to-run variance, which would help contextualize the reported gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We address each major comment below with clarifications and indicate the revisions planned for the manuscript.

read point-by-point responses

Referee: [§3] §3 (SENTRY mechanism): The central claim that neighbor-aware cycle-consistent matching reliably selects correct masks rests on the untested assumption that short-horizon consistency is a sufficient proxy under occlusion/rapid motion/distractors; no ablation, failure-case analysis, or quantitative comparison of scores for true vs. drifted but temporally coherent hypotheses is provided to substantiate this load-bearing step.

Authors: We acknowledge that the manuscript does not contain a dedicated quantitative comparison of consistency scores between correct and drifted-but-coherent hypotheses, nor dedicated failure-case analysis on this specific point. The cross-benchmark gains provide supporting evidence, but to directly address the concern we will add an ablation study and selected failure-case visualizations in the revised version. revision: yes
Referee: [§4] Experimental evaluation (throughout §4 and tables): Gains and new SOTA claims are reported after re-evaluating baselines, but the manuscript does not specify the exact dataset splits, versions, or statistical significance tests used; without these, the reported improvements cannot be independently verified and the cross-benchmark consistency claim is difficult to assess.

Authors: All evaluations followed the official benchmark splits and dataset versions released by the respective organizers. We will explicitly document these versions and splits in the revised manuscript. Statistical significance testing is uncommon in the tracking literature; we will add a note on result consistency across the nine benchmarks but do not plan to introduce new statistical tests unless required. revision: partial
Referee: [§3.2] §3.2 (cycle-consistent matching): The aggregation of hypotheses into tracklets and scoring procedure is described at a high level, but lacks explicit equations or pseudocode for the neighbor-aware component; this makes it impossible to determine whether the method can assign high consistency scores when all hypotheses share a common error pattern (e.g., aligned distractor motion).

Authors: We agree that the neighbor-aware component would benefit from a more formal description. The revised manuscript will include explicit equations and pseudocode for the full cycle-consistent matching procedure, together with a brief discussion of robustness to shared error patterns such as aligned distractor motion. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic module evaluated on external benchmarks

full rationale

The paper introduces SENTRY as a training-free plug-and-play module that replaces confidence-based memory writes with neighbor-aware cycle-consistent matching for short-horizon temporal validation. All reported gains are empirical results from integrating the module into existing baselines and measuring performance on independent external benchmarks (LaSOT, GOT-10k, VOT20/22, etc.). No equations, fitted parameters, or self-citation chains are present that would reduce any claimed result to an input by construction; the central contribution is an algorithmic filter whose validity is tested outside the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on a domain assumption about the reliability of short-horizon consistency rather than new mathematical axioms or fitted parameters.

axioms (1)

domain assumption Short-horizon temporal consistency via neighbor-aware cycle-consistent matching reliably identifies correct masks under occlusion and distractors.
This premise underpins the entire refine-before-write logic described in the abstract.

pith-pipeline@v0.9.1-grok · 5813 in / 1143 out tokens · 17715 ms · 2026-06-26T00:25:38.427682+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 16 canonical work pages

[1]

Scientific Data11(1), 15 (2024)

Alansari, M., Abdul Hay, O., Alansari, S., Javed, S., Shoufan, A., Zweiri, Y., Werghi, N.: Drone-person tracking in uniform appearance crowd: A new dataset. Scientific Data11(1), 15 (2024)

2024
[2]

Information Fu- sion124, 103374 (2025).https : / / doi

Alansari, M., Javed, S., Ganapathi, I.I., Alansari, S., Naseer, M.: Cldtracker: A comprehensive language description for visual tracking. Information Fu- sion124, 103374 (2025).https : / / doi . org / https : / / doi . org / 10 . 1016 / j . inffus.2025.103374,https://www.sciencedirect.com/science/article/pii/ S1566253525004476

arXiv 2025
[3]

DSFormer: A Dual -domain Self - supervised Transformer for Accelerated Multi -contrast MRI Reconstruction,

Asanomi, T., Nishimura, K., Bise, R.: Multi-frame attention with feature-level warping for drone crowd tracking. In: 2023 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV). pp. 1664–1673 (2023).https://doi.org/ 10.1109/WACV56688.2023.00171

work page doi:10.1109/wacv56688.2023.00171 2023
[4]

In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

Bhat, G., Lawin, F.J., Danelljan, M., Robinson, A., Felsberg, M., Van Gool, L., Timofte, R.: Learning what to learn for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 777–794. Springer International Publishing, Cham (2020)

2020
[5]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms – improving object de- tection with one line of code. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

2017
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., Soatto, S.: Memot: Multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8090–8100 (June 2022)

2022
[7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cai, W., Liu, Q., Wang, Y.: Hiptrack: Visual tracking with historical prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19258–19267 (June 2024)

2024
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cai, W., Liu, Q., Wang, Y.: Spmtrack: Spatio-temporal parameter-efficient fine- tuning with mixture of experts for scalable visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16871–16881 (June 2025)

2025
[9]

In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=r35clVtGzw 16 M

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Coll-Vinent, D.S., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., HAZRA, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng...

2026
[10]

In: Kittler, J., Xiong, H., Yang, J., Chen, X., Lu, J., Lin, W., Yu, J., Zheng, W

Chen, R., Sun, G., Li, Y., Qin, J., Benini, L.: Him2sam: Enhancing sam2 with hier- archical motion estimation and memory optimization towards long-term tracking. In: Kittler, J., Xiong, H., Yang, J., Chen, X., Lu, J., Lin, W., Yu, J., Zheng, W. (eds.) Pattern Recognition and Computer Vision. pp. 276–291. Springer Nature Singapore, Singapore (2026)

2026
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, X., Peng, H., Wang, D., Lu, H., Hu, H.: Seqtrack: Sequence to sequence learning for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14572–14581 (June 2023)

2023
[12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 8126–8135 (2021)

2021
[13]

In: Bartoli, A., Fusiello, A

Chen, Y., Xu, J., Yu, J., Wang, Q., Yoo, B., Han, J.J.: Afod: Adaptive focused discriminative segmentation tracker. In: Bartoli, A., Fusiello, A. (eds.) Computer Vision – ECCV 2020 Workshops. pp. 666–682. Springer International Publishing, Cham (2020)

2020
[14]

Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,

Chen, Y.H., Wang, C.Y., Yang, C.Y., Chang, H.S., Lin, Y.L., Chuang, Y.Y., Liao, H.Y.M.: Neighbortrack: Single object tracking by bipartite matching with neigh- bor tracklets and its applications to sports. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 5139–5148 (2023).https://doi.org/10.1109/CVPRW59228.2023.00542

work page doi:10.1109/cvprw59228.2023.00542 2023
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cheng, H.K., Oh, S.W., Price, B., Lee, J.Y., Schwing, A.: Putting the object back into video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3151–3161 (June 2024)

2024
[16]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Cheng, H.K., Schwing, A.G.: Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 640–658. Springer Nature Switzerland, Cham (2022)

2022
[17]

In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with im- proved memory coverage for efficient video object segmentation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neu- ral Information Processing Systems. vol. 34, pp. 11781–11794. Curran Associates, Inc. (2021),https://proceedings.neurips....

2021
[18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: End-to-end tracking with itera- tive mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13608–13618 (2022)

2022
[19]

IEEE Transactions on Pattern Analysis and Machine Intelligence 46(6), 4129–4146 (2024).https://doi.org/10.1109/TPAMI.2024.3349519

Cui,Y.,Jiang,C.,Wu,G.,Wang,L.:Mixformer:End-to-endtrackingwithiterative mixed attention. IEEE Transactions on Pattern Analysis and Machine Intelligence 46(6), 4129–4146 (2024).https://doi.org/10.1109/TPAMI.2024.3349519

work page doi:10.1109/tpami.2024.3349519 2024
[20]

2023 , url =

Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: Mose: A new dataset for video object segmentation in complex scenes. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 20167–20177 (2023).https://doi. org/10.1109/ICCV51070.2023.01850

work page doi:10.1109/iccv51070.2023.01850 2023
[21]

arXiv preprint arXiv:2410.16268 (2024)

Ding, S., Qian, R., Dong, X., Zhang, P., Zang, Y., Cao, Y., Guo, Y., Lin, D., Wang, J.: Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree. arXiv preprint arXiv:2410.16268 (2024)

arXiv 2024
[22]

International Journal of Computer Vision129(2), 439–461 (2021) SENTRY 17

Fan, H., Bai, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Harshit, Huang, M., Liu, J., et al.: Lasot: A high-quality large-scale single object tracking benchmark. International Journal of Computer Vision129(2), 439–461 (2021) SENTRY 17

2021
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H.: Lasot: A high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

2019
[24]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Feng, X., Li, X., Hu, S., Zhang, D., Wu, M., Zhang, J., Chen, X., Huang, K.: Memvlt: Vision-language tracking with adaptive memory-based prompts. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems. vol. 37, pp. 14903–14933. Curran Associates, Inc. (2024),https:/...

2024
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Fu, Z., Liu, Q., Fu, Z., Wang, Y.: Stmtrack: Template-free visual tracking with space-time memory networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13774–13783 (June 2021)

2021
[26]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Hong, L., Chen, W., Liu, Z., Zhang, W., Guo, P., Chen, Z., Zhang, W.: Lvos: A benchmark for long-term video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13480– 13492 (October 2023)

2023
[27]

IEEE Transactions on Pattern Analysis and Machine Intelli- gence48(1), 946–961 (2026).https://doi.org/10.1109/TPAMI.2025.3611020

Hong, L., Liu, Z., Chen, W., Tan, C., Feng, Y., Zhou, X., Guo, P., Li, J., Chen, Z., Gao, S., Zhang, W., Zhang, W.: Lvos: A benchmark for large-scale long-term video object segmentation. IEEE Transactions on Pattern Analysis and Machine Intelli- gence48(1), 946–961 (2026).https://doi.org/10.1109/TPAMI.2025.3611020

work page doi:10.1109/tpami.2025.3611020 2026
[28]

IEEE Transactions on Pattern Analysis and Machine Intelligence43(5), 1562–1577 (2021)

Huang, L., Zhao, X., Huang, K.: Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence43(5), 1562–1577 (2021)

2021
[29]

2024 , pages =

Huang, Y., Li, X., Zhou, Z., Wang, Y., He, Z., Yang, M.H.: Rtracker: Recov- erable tracking via pn tree structured memory. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19038–19047 (2024). https://doi.org/10.1109/CVPR52733.2024.01801

work page doi:10.1109/cvpr52733.2024.01801 2024
[30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4015–4026 (October 2023)

2023
[31]

Alansari, Y

Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J.K., Chang, H.J., Danelljan, M., Zajc, L.Č., Lukežič, A., Drbohlav, O., Björklund, J., Zhang, Y., Zhang, Z., Yan, S., Yang, W., Cai, D., Mayer, C., Fernández, G., Ben, K., Bhat, G., Chang, H., Chen, G., Chen, J., Chen, S., Chen, X., Chen, X., Chen, X., Chen, Y., Chen, Y.H.,...

2022
[32]

In: Bartoli, A., Fusiello, A

Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J.K., Danelljan, M., Zajc, L.Č., Lukežič, A., Drbohlav, O., He, L., Zhang, Y., Yan, S., Yang, J., Fernández, G., Hauptmann, A., Memarmoghadam, A., García-Martín, Á., Robinson, A., Varfolomieiev, A., Gebrehiwot, A.H., Uzun, B., Yan, B., Li, B., Qian, C., Tsai, C.Y., Micheloni...

2020
[33]

In: Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T

Kristan, M., Matas, J., Tokmakov, P., Felsberg, M., Zajc, L.Č., Lukežič, A., Tran, K.T., Vu, X.S., Björklund, J., Chang, H.J., Fernández, G., Attari, M., Chan, A., Chen, L., Chen, X., Collins, J., Cui, Y., Devarapu, G.S.M., Du, Y., Fan, H., Fan, W.C., Feng, Z., Gao, M., Gorthi, R.K.S., Goyal, R., Han, J., Hatuwal, B., He, Z., Hu, X., Huang, X., Huang, Y.,...

2024
[34]

In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Kristan, M., Matas, J., Danelljan, M., Felsberg, M., Chang, H.J., Čehovin Zajc, L., Lukežič, A., Drbohlav, O., Zhang, Z., Tran, K.T., Vu, X.S., Björklund, J., Mayer, C., Zhang, Y., Ke, L., Zhao, J., Fernández, G., Al-Shakarji, N., An, D., Arens, M., Becker, S., Bhat, G., Bullinger, S., Chan, A.B., Chang, S., Chen, H., Chen, X., Chen, Y., Chen, Z., Cheng, ...

work page doi:10.1109/iccvw60793.2023.00195 2023
[35]

In: Proceedingsof the IEEE/CVFConference on ComputerVision and Pattern Recognition (CVPR) Workshops (June 2020)

Li, C., Yang, T., Zhu, S., Chen, C., Guan, S.: Density map guided object detection in aerial images. In: Proceedingsof the IEEE/CVFConference on ComputerVision and Pattern Recognition (CVPR) Workshops (June 2020)

2020
[36]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, X., Zhong, B., Liang, Q., Mo, Z., Nong, J., Song, S.: Dynamic updates for language adaptation in visual-language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19165– 19174 (June 2025)

2025
[37]

In: The Thirteenth International Conferenceon LearningRepresentations(2025),https://openreview.net/forum? id=EM93t94zEi

Li, X., Miao, D., He, Z., Wang, Y., Lu, H., Yang, M.H.: Learning spatial-semantic features for robust video object segmentation. In: The Thirteenth International Conferenceon LearningRepresentations(2025),https://openreview.net/forum? id=EM93t94zEi

2025
[38]

In: 2025 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)

Liang, S., Bai, Y., Gong, Y., Wei, X.: Autoregressive sequential pretraining for visual tracking. In: 2025 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR). pp. 7254–7264 (2025).https://doi.org/10.1109/ CVPR52734.2025.00680

arXiv 2025
[39]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Lin, L., Fan, H., Zhang, Z., Wang, Y., Xu, Y., Ling, H.: Tracking meets lora: Faster training, larger model, stronger performance. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 300–318. Springer Nature Switzerland, Cham (2025)

2024
[40]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 38–55. Springer Nature Switzerland,...

2024
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Liu, X., Zhou, L., Zhou, Z., Chen, J., He, Z.: Mambavlt: Time-evolving multimodal state space model for vision-language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8731–8741 (June 2025)

2025
[42]

2024.3445770

Lukežič, A., Matas, J., Kristan, M.: A discriminative single-shot segmentation net- work for visual object tracking. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence44(12), 9742–9755 (2022).https://doi.org/10.1109/TPAMI. 2021.3137933

work page doi:10.1109/tpami 2022
[43]

Ma, Y., Tang, Y., Yang, W., Zhang, T., Zhang, J., Kang, M.: Unifying vi- sual and vision-language tracking via contrastive learning. Proceedings of the AAAI Conference on Artificial Intelligence38(5), 4107–4116 (Mar 2024).https: //doi.org/10.1609/aaai.v38i5.28205,https://ojs.aaai.org/index.php/ AAAI/article/view/28205

work page doi:10.1609/aaai.v38i5.28205 2024
[44]

In: Bartoli, A., Fusiello, A

Ma, Z., Wang, L., Zhang, H., Lu, W., Yin, J.: Rpt: Learning point set representa- tion for siamese visual tracking. In: Bartoli, A., Fusiello, A. (eds.) Computer Vision 20 M. Alansari, Y. Michael et al. – ECCV 2020 Workshops. pp. 653–665. Springer International Publishing, Cham (2020)

2020
[45]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Mayer, C., Danelljan, M., Paudel, D.P., Van Gool, L.: Learning target candidate association to keep track of what not to track. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13444–13454 (October 2021)

2021
[46]

Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,

Meethal, A., Granger, E., Pedersoli, M.: Cascaded zoom-in detector for high resolution aerial images. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 2046–2055 (2023).https: //doi.org/10.1109/CVPRW59228.2023.00198

work page doi:10.1109/cvprw59228.2023.00198 2023
[47]

In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018)

Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: Trackingnet: A large- scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018)

2018
[48]

Journal of the Society for Industrial and Applied Mathematics5(1), 32–38 (1957).https: //doi.org/10.1137/0105003,https://doi.org/10.1137/0105003

Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics5(1), 32–38 (1957).https: //doi.org/10.1137/0105003,https://doi.org/10.1137/0105003

work page doi:10.1137/0105003 1957
[49]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

2019
[50]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Qin, H., Xu, T., Li, T., Chen, Z., Feng, T., Li, J.: Must: The first dataset and unified framework for multispectral uav single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16882–16891 (June 2025)

2025
[51]

In: The Thirteenth International Conference on Learning Representations (2025)

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollar, P., Feichtenhofer, C.: SAM 2: Segment anything in images and videos. In: The Thirteenth International Conference on Learning Representations (2025)

2025
[52]

In: 2018 IEEE/CVF Conference onComputerVisionandPatternRecognition.pp.5353–5362(2018).https://doi

Ren, W., Kang, D., Tang, Y., Chan, A.B.: Fusing crowd density maps and visual object trackers for people tracking in crowd scenes. In: 2018 IEEE/CVF Conference onComputerVisionandPatternRecognition.pp.5353–5362(2018).https://doi. org/10.1109/CVPR.2018.00561

work page doi:10.1109/cvpr.2018.00561 2018
[53]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Ryali, C., Hu, Y.T., Bolya, D., Wei, C., Fan, H., Huang, P.Y., Aggarwal, V., Chowdhury, A., Poursaeed, O., Hoffman, J., Malik, J., Li, Y., Feichtenhofer, C.: Hiera: A hierarchical vision transformer without the bells-and-whistles. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceed- ings of the 40th Internationa...

2023
[54]

In: 2017 IEEE International Conference on Computer Vision (ICCV)

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 618–626 (2017).https://doi.org/10.1109/ICCV.2017.74

work page doi:10.1109/iccv.2017.74 2017
[55]

In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

Seong, H., Hyun, J., Kim, E.: Kernelized memory network for video object segmen- tation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 629–645. Springer International Publishing, Cham (2020)

2020
[56]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Shao, Y., He, S., Ye, Q., Feng, Y., Luo, W., Chen, J.: Context-aware integration of language and visual references for natural language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19208–19217 (June 2024) SENTRY 21

2024
[57]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Sun,P.,Cao,J.,Jiang,Y., Yuan,Z.,Bai,S., Kitani,K.,Luo,P.:Dancetrack:Multi- object tracking in uniform appearance and diverse motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20993–21002 (June 2022)

2022
[58]

In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)

Videnovic, J., Lukezic, A., Kristan, M.: A distractor-aware memory for visual ob- ject tracking with sam2. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp. 24255–24264 (June 2025)

2025
[59]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

2019
[60]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, X., Shu, X., Zhang, Z., Jiang, B., Wang, Y., Tian, Y., Wu, F.: Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13763–13773 (2021)

2021
[61]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wen, L., Du, D., Zhu, P., Hu, Q., Wang, Q., Bo, L., Lyu, S.: Detection, track- ing, and counting meets drones in crowds: A benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7812–7821 (June 2021)

2021
[62]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wu, Y., Wang, X., Yang, X., Liu, M., Zeng, D., Ye, H., Li, S.: Learning occlusion- robust vision transformers for real-time uav tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17103–17113 (June 2025)

2025
[63]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xie, F., Wang, Z., Ma, C.: Diffusiontrack: Point set diffusion model for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19113–19124 (June 2024)

2024
[64]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xie, J., Zhong, B., Mo, Z., Zhang, S., Shi, L., Song, S., Ji, R.: Autoregressive queries for adaptive tracking with spatio-temporal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19300–19309 (June 2024)

2024
[65]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Xiong, Y., Zhou, C., Xiang, X., Wu, L., Zhu, C., Liu, Z., Suri, S., Varadarajan, B., Akula, R., Iandola, F., Krishnamoorthi, R., Soran, B., Chandra, V.: Efficient track anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11513–11524 (October 2025)

2025
[66]

arXiv preprint arXiv:2507.21732 (2025)

Xu, Q., Zhu, L., Liu, C., Lin, G., Long, C., Li, Z., Zhao, R.: Samite: Position prompted sam2 with calibrated memory for visual object tracking. arXiv preprint arXiv:2507.21732 (2025)

arXiv 2025
[67]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xue, C., Zhong, B., Liang, Q., Zheng, Y., Li, N., Xue, Y., Song, S.: Similarity- guided layer-adaptive vision transformer for uav tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6730–6740 (June 2025)

2025
[68]

IEEE Transactions on Multime- dia26, 6228–6237 (2024).https://doi.org/10.1109/TMM.2023.3347644

Xun, Z., Di, S., Gao, Y., Tang, Z., Wang, G., Liu, S., Li, B.: Linker: Learning long short-term associations for robust visual tracking. IEEE Transactions on Multime- dia26, 6228–6237 (2024).https://doi.org/10.1109/TMM.2023.3347644

work page doi:10.1109/tmm.2023.3347644 2024
[69]

In: Proceedings of the IEEE/CVF international conference on computer vision

Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10448–10457 (2021)

2021
[70]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yan, B., Zhang, X., Wang, D., Lu, H., Yang, X.: Alpha-refine: Boosting track- ing performance by precise bounding box estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5289–5298 (June 2021) 22 M. Alansari, Y. Michael et al

2021
[71]

Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou

Yang, C.Y., Huang, H.W., Chai, W., Jiang, Z., Hwang, J.N.: Samurai: Motion- aware memory for training-free visual object tracking with sam 2. IEEE Transac- tions on Image Processing35, 970–982 (2026).https://doi.org/10.1109/TIP. 2026.3651835

work page doi:10.1109/tip 2026
[72]

In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

Yang, Z., Wei, Y., Yang, Y.: Associating objects with transformers for video object segmentation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 2491–
[73]

(2021),https://proceedings.neurips.cc/paper_ files/paper/2021/file/147702db07145348245dc5a2f2fe5683-Paper.pdf

Curran Associates, Inc. (2021),https://proceedings.neurips.cc/paper_ files/paper/2021/file/147702db07145348245dc5a2f2fe5683-Paper.pdf

2021
[75]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Yang, Z., Yang, Y.: Decoupling features in hierarchical propagation for video ob- ject segmentation. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 36324–36336. Curran Associates, Inc. (2022),https://proceedings.neurips.cc/ paper _ files / paper / 2022 / file /...

2022
[76]

In: European conference on computer vision

Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and rela- tion modeling for tracking: A one-stream framework. In: European conference on computer vision. pp. 341–357. Springer (2022)

2022
[77]

In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W.: Ocean: Object-aware anchor-free track- ing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 771–787. Springer International Publishing, Cham (2020)

2020
[78]

Zheng, Y., Zhong, B., Liang, Q., Mo, Z., Zhang, S., Li, X.: Odtrack: Online dense temporal token learning for visual tracking. Proceedings of the AAAI Confer- ence on Artificial Intelligence38(7), 7588–7596 (Mar 2024).https://doi.org/ 10.1609/aaai.v38i7.28591,https://ojs.aaai.org/index.php/AAAI/article/ view/28591

work page doi:10.1609/aaai.v38i7.28591 2024
[79]

non-negligible distractor

Zhou, J., Pang, Z., Wang, Y.X.: Rmem: Restricted memory banks improve video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18602–18611 (June 2024) SENTRY 23 Appendix A Additional Ablation Study ...................................... 23 B Memory-Write Diagnostic Analysis .................

2024

[1] [1]

Scientific Data11(1), 15 (2024)

Alansari, M., Abdul Hay, O., Alansari, S., Javed, S., Shoufan, A., Zweiri, Y., Werghi, N.: Drone-person tracking in uniform appearance crowd: A new dataset. Scientific Data11(1), 15 (2024)

2024

[2] [2]

Information Fu- sion124, 103374 (2025).https : / / doi

Alansari, M., Javed, S., Ganapathi, I.I., Alansari, S., Naseer, M.: Cldtracker: A comprehensive language description for visual tracking. Information Fu- sion124, 103374 (2025).https : / / doi . org / https : / / doi . org / 10 . 1016 / j . inffus.2025.103374,https://www.sciencedirect.com/science/article/pii/ S1566253525004476

arXiv 2025

[3] [3]

DSFormer: A Dual -domain Self - supervised Transformer for Accelerated Multi -contrast MRI Reconstruction,

Asanomi, T., Nishimura, K., Bise, R.: Multi-frame attention with feature-level warping for drone crowd tracking. In: 2023 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV). pp. 1664–1673 (2023).https://doi.org/ 10.1109/WACV56688.2023.00171

work page doi:10.1109/wacv56688.2023.00171 2023

[4] [4]

In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

Bhat, G., Lawin, F.J., Danelljan, M., Robinson, A., Felsberg, M., Van Gool, L., Timofte, R.: Learning what to learn for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 777–794. Springer International Publishing, Cham (2020)

2020

[5] [5]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms – improving object de- tection with one line of code. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

2017

[6] [6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., Soatto, S.: Memot: Multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8090–8100 (June 2022)

2022

[7] [7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cai, W., Liu, Q., Wang, Y.: Hiptrack: Visual tracking with historical prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19258–19267 (June 2024)

2024

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cai, W., Liu, Q., Wang, Y.: Spmtrack: Spatio-temporal parameter-efficient fine- tuning with mixture of experts for scalable visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16871–16881 (June 2025)

2025

[9] [9]

In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=r35clVtGzw 16 M

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Coll-Vinent, D.S., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., HAZRA, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng...

2026

[10] [10]

In: Kittler, J., Xiong, H., Yang, J., Chen, X., Lu, J., Lin, W., Yu, J., Zheng, W

Chen, R., Sun, G., Li, Y., Qin, J., Benini, L.: Him2sam: Enhancing sam2 with hier- archical motion estimation and memory optimization towards long-term tracking. In: Kittler, J., Xiong, H., Yang, J., Chen, X., Lu, J., Lin, W., Yu, J., Zheng, W. (eds.) Pattern Recognition and Computer Vision. pp. 276–291. Springer Nature Singapore, Singapore (2026)

2026

[11] [11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, X., Peng, H., Wang, D., Lu, H., Hu, H.: Seqtrack: Sequence to sequence learning for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14572–14581 (June 2023)

2023

[12] [12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 8126–8135 (2021)

2021

[13] [13]

In: Bartoli, A., Fusiello, A

Chen, Y., Xu, J., Yu, J., Wang, Q., Yoo, B., Han, J.J.: Afod: Adaptive focused discriminative segmentation tracker. In: Bartoli, A., Fusiello, A. (eds.) Computer Vision – ECCV 2020 Workshops. pp. 666–682. Springer International Publishing, Cham (2020)

2020

[14] [14]

Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,

Chen, Y.H., Wang, C.Y., Yang, C.Y., Chang, H.S., Lin, Y.L., Chuang, Y.Y., Liao, H.Y.M.: Neighbortrack: Single object tracking by bipartite matching with neigh- bor tracklets and its applications to sports. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 5139–5148 (2023).https://doi.org/10.1109/CVPRW59228.2023.00542

work page doi:10.1109/cvprw59228.2023.00542 2023

[15] [15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cheng, H.K., Oh, S.W., Price, B., Lee, J.Y., Schwing, A.: Putting the object back into video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3151–3161 (June 2024)

2024

[16] [16]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Cheng, H.K., Schwing, A.G.: Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 640–658. Springer Nature Switzerland, Cham (2022)

2022

[17] [17]

In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with im- proved memory coverage for efficient video object segmentation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neu- ral Information Processing Systems. vol. 34, pp. 11781–11794. Curran Associates, Inc. (2021),https://proceedings.neurips....

2021

[18] [18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: End-to-end tracking with itera- tive mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13608–13618 (2022)

2022

[19] [19]

IEEE Transactions on Pattern Analysis and Machine Intelligence 46(6), 4129–4146 (2024).https://doi.org/10.1109/TPAMI.2024.3349519

Cui,Y.,Jiang,C.,Wu,G.,Wang,L.:Mixformer:End-to-endtrackingwithiterative mixed attention. IEEE Transactions on Pattern Analysis and Machine Intelligence 46(6), 4129–4146 (2024).https://doi.org/10.1109/TPAMI.2024.3349519

work page doi:10.1109/tpami.2024.3349519 2024

[20] [20]

2023 , url =

Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: Mose: A new dataset for video object segmentation in complex scenes. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 20167–20177 (2023).https://doi. org/10.1109/ICCV51070.2023.01850

work page doi:10.1109/iccv51070.2023.01850 2023

[21] [21]

arXiv preprint arXiv:2410.16268 (2024)

Ding, S., Qian, R., Dong, X., Zhang, P., Zang, Y., Cao, Y., Guo, Y., Lin, D., Wang, J.: Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree. arXiv preprint arXiv:2410.16268 (2024)

arXiv 2024

[22] [22]

International Journal of Computer Vision129(2), 439–461 (2021) SENTRY 17

Fan, H., Bai, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Harshit, Huang, M., Liu, J., et al.: Lasot: A high-quality large-scale single object tracking benchmark. International Journal of Computer Vision129(2), 439–461 (2021) SENTRY 17

2021

[23] [23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H.: Lasot: A high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

2019

[24] [24]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Feng, X., Li, X., Hu, S., Zhang, D., Wu, M., Zhang, J., Chen, X., Huang, K.: Memvlt: Vision-language tracking with adaptive memory-based prompts. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems. vol. 37, pp. 14903–14933. Curran Associates, Inc. (2024),https:/...

2024

[25] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Fu, Z., Liu, Q., Fu, Z., Wang, Y.: Stmtrack: Template-free visual tracking with space-time memory networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13774–13783 (June 2021)

2021

[26] [26]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Hong, L., Chen, W., Liu, Z., Zhang, W., Guo, P., Chen, Z., Zhang, W.: Lvos: A benchmark for long-term video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13480– 13492 (October 2023)

2023

[27] [27]

IEEE Transactions on Pattern Analysis and Machine Intelli- gence48(1), 946–961 (2026).https://doi.org/10.1109/TPAMI.2025.3611020

Hong, L., Liu, Z., Chen, W., Tan, C., Feng, Y., Zhou, X., Guo, P., Li, J., Chen, Z., Gao, S., Zhang, W., Zhang, W.: Lvos: A benchmark for large-scale long-term video object segmentation. IEEE Transactions on Pattern Analysis and Machine Intelli- gence48(1), 946–961 (2026).https://doi.org/10.1109/TPAMI.2025.3611020

work page doi:10.1109/tpami.2025.3611020 2026

[28] [28]

IEEE Transactions on Pattern Analysis and Machine Intelligence43(5), 1562–1577 (2021)

Huang, L., Zhao, X., Huang, K.: Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence43(5), 1562–1577 (2021)

2021

[29] [29]

2024 , pages =

Huang, Y., Li, X., Zhou, Z., Wang, Y., He, Z., Yang, M.H.: Rtracker: Recov- erable tracking via pn tree structured memory. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19038–19047 (2024). https://doi.org/10.1109/CVPR52733.2024.01801

work page doi:10.1109/cvpr52733.2024.01801 2024

[30] [30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4015–4026 (October 2023)

2023

[31] [31]

Alansari, Y

Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J.K., Chang, H.J., Danelljan, M., Zajc, L.Č., Lukežič, A., Drbohlav, O., Björklund, J., Zhang, Y., Zhang, Z., Yan, S., Yang, W., Cai, D., Mayer, C., Fernández, G., Ben, K., Bhat, G., Chang, H., Chen, G., Chen, J., Chen, S., Chen, X., Chen, X., Chen, X., Chen, Y., Chen, Y.H.,...

2022

[32] [32]

In: Bartoli, A., Fusiello, A

Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J.K., Danelljan, M., Zajc, L.Č., Lukežič, A., Drbohlav, O., He, L., Zhang, Y., Yan, S., Yang, J., Fernández, G., Hauptmann, A., Memarmoghadam, A., García-Martín, Á., Robinson, A., Varfolomieiev, A., Gebrehiwot, A.H., Uzun, B., Yan, B., Li, B., Qian, C., Tsai, C.Y., Micheloni...

2020

[33] [33]

In: Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T

Kristan, M., Matas, J., Tokmakov, P., Felsberg, M., Zajc, L.Č., Lukežič, A., Tran, K.T., Vu, X.S., Björklund, J., Chang, H.J., Fernández, G., Attari, M., Chan, A., Chen, L., Chen, X., Collins, J., Cui, Y., Devarapu, G.S.M., Du, Y., Fan, H., Fan, W.C., Feng, Z., Gao, M., Gorthi, R.K.S., Goyal, R., Han, J., Hatuwal, B., He, Z., Hu, X., Huang, X., Huang, Y.,...

2024

[34] [34]

In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Kristan, M., Matas, J., Danelljan, M., Felsberg, M., Chang, H.J., Čehovin Zajc, L., Lukežič, A., Drbohlav, O., Zhang, Z., Tran, K.T., Vu, X.S., Björklund, J., Mayer, C., Zhang, Y., Ke, L., Zhao, J., Fernández, G., Al-Shakarji, N., An, D., Arens, M., Becker, S., Bhat, G., Bullinger, S., Chan, A.B., Chang, S., Chen, H., Chen, X., Chen, Y., Chen, Z., Cheng, ...

work page doi:10.1109/iccvw60793.2023.00195 2023

[35] [35]

In: Proceedingsof the IEEE/CVFConference on ComputerVision and Pattern Recognition (CVPR) Workshops (June 2020)

Li, C., Yang, T., Zhu, S., Chen, C., Guan, S.: Density map guided object detection in aerial images. In: Proceedingsof the IEEE/CVFConference on ComputerVision and Pattern Recognition (CVPR) Workshops (June 2020)

2020

[36] [36]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, X., Zhong, B., Liang, Q., Mo, Z., Nong, J., Song, S.: Dynamic updates for language adaptation in visual-language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19165– 19174 (June 2025)

2025

[37] [37]

In: The Thirteenth International Conferenceon LearningRepresentations(2025),https://openreview.net/forum? id=EM93t94zEi

Li, X., Miao, D., He, Z., Wang, Y., Lu, H., Yang, M.H.: Learning spatial-semantic features for robust video object segmentation. In: The Thirteenth International Conferenceon LearningRepresentations(2025),https://openreview.net/forum? id=EM93t94zEi

2025

[38] [38]

In: 2025 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)

Liang, S., Bai, Y., Gong, Y., Wei, X.: Autoregressive sequential pretraining for visual tracking. In: 2025 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR). pp. 7254–7264 (2025).https://doi.org/10.1109/ CVPR52734.2025.00680

arXiv 2025

[39] [39]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Lin, L., Fan, H., Zhang, Z., Wang, Y., Xu, Y., Ling, H.: Tracking meets lora: Faster training, larger model, stronger performance. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 300–318. Springer Nature Switzerland, Cham (2025)

2024

[40] [40]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 38–55. Springer Nature Switzerland,...

2024

[41] [41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Liu, X., Zhou, L., Zhou, Z., Chen, J., He, Z.: Mambavlt: Time-evolving multimodal state space model for vision-language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8731–8741 (June 2025)

2025

[42] [42]

2024.3445770

Lukežič, A., Matas, J., Kristan, M.: A discriminative single-shot segmentation net- work for visual object tracking. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence44(12), 9742–9755 (2022).https://doi.org/10.1109/TPAMI. 2021.3137933

work page doi:10.1109/tpami 2022

[43] [43]

Ma, Y., Tang, Y., Yang, W., Zhang, T., Zhang, J., Kang, M.: Unifying vi- sual and vision-language tracking via contrastive learning. Proceedings of the AAAI Conference on Artificial Intelligence38(5), 4107–4116 (Mar 2024).https: //doi.org/10.1609/aaai.v38i5.28205,https://ojs.aaai.org/index.php/ AAAI/article/view/28205

work page doi:10.1609/aaai.v38i5.28205 2024

[44] [44]

In: Bartoli, A., Fusiello, A

Ma, Z., Wang, L., Zhang, H., Lu, W., Yin, J.: Rpt: Learning point set representa- tion for siamese visual tracking. In: Bartoli, A., Fusiello, A. (eds.) Computer Vision 20 M. Alansari, Y. Michael et al. – ECCV 2020 Workshops. pp. 653–665. Springer International Publishing, Cham (2020)

2020

[45] [45]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Mayer, C., Danelljan, M., Paudel, D.P., Van Gool, L.: Learning target candidate association to keep track of what not to track. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13444–13454 (October 2021)

2021

[46] [46]

Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,

Meethal, A., Granger, E., Pedersoli, M.: Cascaded zoom-in detector for high resolution aerial images. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 2046–2055 (2023).https: //doi.org/10.1109/CVPRW59228.2023.00198

work page doi:10.1109/cvprw59228.2023.00198 2023

[47] [47]

In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018)

Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: Trackingnet: A large- scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018)

2018

[48] [48]

Journal of the Society for Industrial and Applied Mathematics5(1), 32–38 (1957).https: //doi.org/10.1137/0105003,https://doi.org/10.1137/0105003

Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics5(1), 32–38 (1957).https: //doi.org/10.1137/0105003,https://doi.org/10.1137/0105003

work page doi:10.1137/0105003 1957

[49] [49]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

2019

[50] [50]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Qin, H., Xu, T., Li, T., Chen, Z., Feng, T., Li, J.: Must: The first dataset and unified framework for multispectral uav single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16882–16891 (June 2025)

2025

[51] [51]

In: The Thirteenth International Conference on Learning Representations (2025)

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollar, P., Feichtenhofer, C.: SAM 2: Segment anything in images and videos. In: The Thirteenth International Conference on Learning Representations (2025)

2025

[52] [52]

In: 2018 IEEE/CVF Conference onComputerVisionandPatternRecognition.pp.5353–5362(2018).https://doi

Ren, W., Kang, D., Tang, Y., Chan, A.B.: Fusing crowd density maps and visual object trackers for people tracking in crowd scenes. In: 2018 IEEE/CVF Conference onComputerVisionandPatternRecognition.pp.5353–5362(2018).https://doi. org/10.1109/CVPR.2018.00561

work page doi:10.1109/cvpr.2018.00561 2018

[53] [53]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Ryali, C., Hu, Y.T., Bolya, D., Wei, C., Fan, H., Huang, P.Y., Aggarwal, V., Chowdhury, A., Poursaeed, O., Hoffman, J., Malik, J., Li, Y., Feichtenhofer, C.: Hiera: A hierarchical vision transformer without the bells-and-whistles. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceed- ings of the 40th Internationa...

2023

[54] [54]

In: 2017 IEEE International Conference on Computer Vision (ICCV)

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 618–626 (2017).https://doi.org/10.1109/ICCV.2017.74

work page doi:10.1109/iccv.2017.74 2017

[55] [55]

In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

Seong, H., Hyun, J., Kim, E.: Kernelized memory network for video object segmen- tation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 629–645. Springer International Publishing, Cham (2020)

2020

[56] [56]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Shao, Y., He, S., Ye, Q., Feng, Y., Luo, W., Chen, J.: Context-aware integration of language and visual references for natural language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19208–19217 (June 2024) SENTRY 21

2024

[57] [57]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Sun,P.,Cao,J.,Jiang,Y., Yuan,Z.,Bai,S., Kitani,K.,Luo,P.:Dancetrack:Multi- object tracking in uniform appearance and diverse motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20993–21002 (June 2022)

2022

[58] [58]

In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)

Videnovic, J., Lukezic, A., Kristan, M.: A distractor-aware memory for visual ob- ject tracking with sam2. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp. 24255–24264 (June 2025)

2025

[59] [59]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

2019

[60] [60]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, X., Shu, X., Zhang, Z., Jiang, B., Wang, Y., Tian, Y., Wu, F.: Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13763–13773 (2021)

2021

[61] [61]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wen, L., Du, D., Zhu, P., Hu, Q., Wang, Q., Bo, L., Lyu, S.: Detection, track- ing, and counting meets drones in crowds: A benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7812–7821 (June 2021)

2021

[62] [62]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wu, Y., Wang, X., Yang, X., Liu, M., Zeng, D., Ye, H., Li, S.: Learning occlusion- robust vision transformers for real-time uav tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17103–17113 (June 2025)

2025

[63] [63]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xie, F., Wang, Z., Ma, C.: Diffusiontrack: Point set diffusion model for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19113–19124 (June 2024)

2024

[64] [64]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xie, J., Zhong, B., Mo, Z., Zhang, S., Shi, L., Song, S., Ji, R.: Autoregressive queries for adaptive tracking with spatio-temporal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19300–19309 (June 2024)

2024

[65] [65]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Xiong, Y., Zhou, C., Xiang, X., Wu, L., Zhu, C., Liu, Z., Suri, S., Varadarajan, B., Akula, R., Iandola, F., Krishnamoorthi, R., Soran, B., Chandra, V.: Efficient track anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11513–11524 (October 2025)

2025

[66] [66]

arXiv preprint arXiv:2507.21732 (2025)

Xu, Q., Zhu, L., Liu, C., Lin, G., Long, C., Li, Z., Zhao, R.: Samite: Position prompted sam2 with calibrated memory for visual object tracking. arXiv preprint arXiv:2507.21732 (2025)

arXiv 2025

[67] [67]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xue, C., Zhong, B., Liang, Q., Zheng, Y., Li, N., Xue, Y., Song, S.: Similarity- guided layer-adaptive vision transformer for uav tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6730–6740 (June 2025)

2025

[68] [68]

IEEE Transactions on Multime- dia26, 6228–6237 (2024).https://doi.org/10.1109/TMM.2023.3347644

Xun, Z., Di, S., Gao, Y., Tang, Z., Wang, G., Liu, S., Li, B.: Linker: Learning long short-term associations for robust visual tracking. IEEE Transactions on Multime- dia26, 6228–6237 (2024).https://doi.org/10.1109/TMM.2023.3347644

work page doi:10.1109/tmm.2023.3347644 2024

[69] [69]

In: Proceedings of the IEEE/CVF international conference on computer vision

Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10448–10457 (2021)

2021

[70] [70]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yan, B., Zhang, X., Wang, D., Lu, H., Yang, X.: Alpha-refine: Boosting track- ing performance by precise bounding box estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5289–5298 (June 2021) 22 M. Alansari, Y. Michael et al

2021

[71] [71]

Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou

Yang, C.Y., Huang, H.W., Chai, W., Jiang, Z., Hwang, J.N.: Samurai: Motion- aware memory for training-free visual object tracking with sam 2. IEEE Transac- tions on Image Processing35, 970–982 (2026).https://doi.org/10.1109/TIP. 2026.3651835

work page doi:10.1109/tip 2026

[72] [72]

In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

Yang, Z., Wei, Y., Yang, Y.: Associating objects with transformers for video object segmentation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 2491–

[73] [73]

(2021),https://proceedings.neurips.cc/paper_ files/paper/2021/file/147702db07145348245dc5a2f2fe5683-Paper.pdf

Curran Associates, Inc. (2021),https://proceedings.neurips.cc/paper_ files/paper/2021/file/147702db07145348245dc5a2f2fe5683-Paper.pdf

2021

[74] [75]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Yang, Z., Yang, Y.: Decoupling features in hierarchical propagation for video ob- ject segmentation. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 36324–36336. Curran Associates, Inc. (2022),https://proceedings.neurips.cc/ paper _ files / paper / 2022 / file /...

2022

[75] [76]

In: European conference on computer vision

Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and rela- tion modeling for tracking: A one-stream framework. In: European conference on computer vision. pp. 341–357. Springer (2022)

2022

[76] [77]

In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W.: Ocean: Object-aware anchor-free track- ing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 771–787. Springer International Publishing, Cham (2020)

2020

[77] [78]

Zheng, Y., Zhong, B., Liang, Q., Mo, Z., Zhang, S., Li, X.: Odtrack: Online dense temporal token learning for visual tracking. Proceedings of the AAAI Confer- ence on Artificial Intelligence38(7), 7588–7596 (Mar 2024).https://doi.org/ 10.1609/aaai.v38i7.28591,https://ojs.aaai.org/index.php/AAAI/article/ view/28591

work page doi:10.1609/aaai.v38i7.28591 2024

[78] [79]

non-negligible distractor

Zhou, J., Pang, Z., Wang, Y.X.: Rmem: Restricted memory banks improve video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18602–18611 (June 2024) SENTRY 23 Appendix A Additional Ablation Study ...................................... 23 B Memory-Write Diagnostic Analysis .................

2024