pith. sign in

arxiv: 2606.24449 · v2 · pith:LL2XGNTFnew · submitted 2026-06-23 · 💻 cs.CV

SENTRY: SAM2-Enhanced Neighbor-Aware and Temporally Reasoned Memory for Visual Tracking

Pith reviewed 2026-06-26 00:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual object trackingSAM2memory updatetemporal consistencycycle-consistent matchingzero-shot trackingplug-and-play moduledrift reduction
0
0 comments X

The pith

Replacing confidence-only memory writes with neighbor-aware temporal consistency checks stabilizes SAM2-based visual trackers without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies confidence-only mask selection as the main source of drift in SAM2 trackers during occlusion, rapid motion, and distractors. SENTRY inserts a training-free step before each memory write that generates multiple segmentation hypotheses per frame, forms short tracklets, and validates them through neighbor-aware cycle-consistent matching against recent trajectories. This plug-and-play replacement of the write rule improves results when added to existing models. A sympathetic reader would care because the change requires no model updates yet produces gains across many datasets while preserving speed.

Core claim

SENTRY is a refine-before-write module that aggregates diverse segmentation hypotheses, backtracks them into short tracklets, and uses neighbor-aware cycle-consistent matching to enforce short-horizon temporal and geometric consistency before committing any mask to memory, replacing confidence-driven writes in unmodified SAM2 architectures and yielding consistent gains across nine benchmarks with new zero-shot state-of-the-art results on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi.

What carries the argument

The SENTRY refine-before-write module that validates memory updates via neighbor-aware cycle-consistent matching on short tracklets formed from multiple per-frame segmentation hypotheses.

If this is right

  • Consistent gains appear when SENTRY is added to five strong SAM2 baselines across nine benchmarks.
  • New zero-shot state-of-the-art results are reached on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi.
  • The SAM2-L version maintains 32.8 FPS on A100 hardware with only 0.4-0.6 GB added VRAM.
  • The first unified all-scale evaluation of SAM2-based trackers is provided.
  • Enforcing temporal validity at write time stabilizes memory-augmented tracking without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same write-time validation pattern could apply to memory mechanisms in other video models that rely on stored features or masks.
  • Testing longer tracklet horizons or additional geometric constraints would show whether the current short-horizon choice is optimal or merely sufficient.
  • Similar consistency checks might reduce drift in related tasks such as video object segmentation or multi-object tracking.
  • Many memory-augmented systems may benefit more from stricter write rules than from later correction stages.

Load-bearing premise

Short-horizon temporal consistency measured by neighbor-aware cycle-consistent matching is a sufficient proxy for identifying correct segmentation masks under occlusion, rapid motion, and distractors.

What would settle it

Integrating SENTRY into the five evaluated baselines and measuring no consistent improvement on the nine benchmarks would show that the temporal validation step does not stabilize tracking.

Figures

Figures reproduced from arXiv: 2606.24449 by Hasan AlMarzouqi, Mohamad Alansari, Muzammal Naseer, Naoufel Werghi, Sajid Javed, Yonathan Michael.

Figure 1
Figure 1. Figure 1: SENTRY performs favorably against SAM2-based variants across major track￾ing benchmarks. We denote SENTRY applied to SAM2, SAMURAI, and DAM4SAM as SENTRY-S2, SENTRY-SR, and SENTRY-D4S. (a) AUC per benchmark; (b) aver￾age AUC across [22, 23, 47, 60]. SAM2 [51] follows this trend with streaming memory and multiple mask hy￾potheses per frame. Its main weakness is confidence-driven mask selection: the highest-… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison under occlusion, abrupt motion, and distractor inter￾ference. We visualize decoder attention to highlight spatial focus during tracking. (a) SAM2 [51] loses target identity after occlusion, (b) SAMURAI [71] drifts under rapid motion, and (c) DAM4SAM [58] misidentifies distractors due to heuristic filtering. The last row shows SENTRY maintaining accurate localization and consistent ma… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the SENTRY framework. Per-frame candidates combine Automatic Mask Generation (AMG) proposals (A–B) and decoder hypotheses (D) into a joint set (C). Each candidate is evaluated through cycle-consistent, neighbor-aware bipartite matching in trajectory space (E–F). The most consistent mask is written to memory (G–H) following the baseline update schedule. 3.2 SENTRY SENTRY is a training-free, arch… view at source ↗
Figure 4
Figure 4. Figure 4: AUC of attributes on (a) LaSOT [23], (b) LaSOText [22], and (c) TNL2K [60]. tive gains over baselines include 1.2% AO, 1.3% SR0.5, and 1.3% SR0.75 for DAM4SAM; 0.4%, 0.1%, and 1.3% for SAM2; and 0.1%, 0.1%, and 0.3% for SAMURAI. GOT-10k’s category and appearance diversity induces mild incon￾sistencies that SENTRY stabilizes. TrackingNet [47]. All SENTRY variants improve over their respective baselines. Int… view at source ↗
Figure 5
Figure 5. Figure 5: Acc-Rob plot on (a) VOT20 [32], (b) VOT22 [31], (c) VOTS24 [33], and (d) DiDi [58] for SAM2 and SENTRY variants. The Q is given at each label. term trajectory coherence. Row 4: after an extreme long-term occlusion (\approx 100 frames), all methods fail to recover the tiny target, highlighting persistent re￾identification limits. 4.4 Ablation Study Scalability across SAM2 sizes and base frameworks. We integ… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of baselines and SENTRY variants: (1) SAM2 vs. SENTRY-S2, (2) SAMURAI vs. SENTRY-SR, (3) DAM4SAM vs. SENTRY-D4S. (4) Failure case: all methods lose the target after a long out-of-view period (100 frames) with a tiny target. they enter memory. Candidate mask generation. Tab. 6 compares candidate￾generation strategies within SENTRY. Detection–segmentation pipelines such as GroundingDIN… view at source ↗
Figure 7
Figure 7. Figure 7: Our proposed SENTRY performs favorably against SAM2-based variants across different model scales on LaSOT [23] benchmark. B Memory-Write Diagnostic Analysis To directly test whether SENTRY reduces false-positive memory contamina￾tion, we analyze annotation-available testing and validation splits at the fixed memory-update times used by each host tracker. For each written mask Wt, we compute st = J (Wt, Gt)… view at source ↗
read the original abstract

We revisit the memory update mechanism in SAM2-based visual object tracking and identify confidence-only mask selection as the dominant cause of drift under occlusion, rapid motion, and distractors. We introduce SENTRY, a training-free, plug-and-play, refine-before-write module that validates each memory update for short-horizon temporal consistency before committing it. SENTRY aggregates diverse segmentation hypotheses per frame, backtracks them into short tracklets, and uses neighbor-aware cycle-consistent matching against recent trajectories to favor temporally and geometrically consistent masks. It leaves the base architecture untouched, replacing confidence-driven writes with consistency-validated ones. For fair evaluation, we re-evaluate major open-source SAM2-based trackers across all available scales and datasets, filling gaps in prior reports. Integrated into five strong baselines, SENTRY delivers consistent gains across nine benchmarks, achieving new zero-shot SOTA on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi. Despite these checks, the SAM2-L version runs at 32.8 FPS on an A100, and across compatible hosts adds only about 0.4--0.6 GB VRAM. Our results provide the first unified all-scale evaluation of SAM2-based trackers and show that enforcing temporal validity at write time stabilizes memory-augmented tracking without retraining. Project page: https://hamadya.github.io/SENTRY/page/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that confidence-only mask selection in SAM2-based trackers causes drift under occlusion, rapid motion, and distractors; SENTRY is a training-free, plug-and-play module that aggregates segmentation hypotheses, forms short tracklets, and applies neighbor-aware cycle-consistent matching to validate temporal and geometric consistency before memory writes. Replacing confidence-driven updates with these validated writes into five baselines yields consistent gains across nine benchmarks and new zero-shot SOTA on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi, while adding negligible overhead (32.8 FPS and 0.4-0.6 GB VRAM for SAM2-L) and providing the first unified all-scale evaluation of SAM2 trackers.

Significance. If the empirical results hold, the work demonstrates that enforcing short-horizon temporal validity at write time can stabilize memory-augmented tracking without retraining or architectural changes, offering a lightweight, generalizable improvement for foundation-model trackers. The fair re-evaluation of open-source baselines across scales and the reporting of runtime/VRAM metrics are concrete strengths that aid reproducibility and comparison in the field.

major comments (3)
  1. [§3] §3 (SENTRY mechanism): The central claim that neighbor-aware cycle-consistent matching reliably selects correct masks rests on the untested assumption that short-horizon consistency is a sufficient proxy under occlusion/rapid motion/distractors; no ablation, failure-case analysis, or quantitative comparison of scores for true vs. drifted but temporally coherent hypotheses is provided to substantiate this load-bearing step.
  2. [§4] Experimental evaluation (throughout §4 and tables): Gains and new SOTA claims are reported after re-evaluating baselines, but the manuscript does not specify the exact dataset splits, versions, or statistical significance tests used; without these, the reported improvements cannot be independently verified and the cross-benchmark consistency claim is difficult to assess.
  3. [§3.2] §3.2 (cycle-consistent matching): The aggregation of hypotheses into tracklets and scoring procedure is described at a high level, but lacks explicit equations or pseudocode for the neighbor-aware component; this makes it impossible to determine whether the method can assign high consistency scores when all hypotheses share a common error pattern (e.g., aligned distractor motion).
minor comments (2)
  1. [Abstract] Abstract and §4: The phrase 'new zero-shot SOTA' should be accompanied by explicit comparison tables showing the previous best scores and the exact margins achieved.
  2. [§4] Figure captions and §4: Several result tables lack error bars or run-to-run variance, which would help contextualize the reported gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We address each major comment below with clarifications and indicate the revisions planned for the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (SENTRY mechanism): The central claim that neighbor-aware cycle-consistent matching reliably selects correct masks rests on the untested assumption that short-horizon consistency is a sufficient proxy under occlusion/rapid motion/distractors; no ablation, failure-case analysis, or quantitative comparison of scores for true vs. drifted but temporally coherent hypotheses is provided to substantiate this load-bearing step.

    Authors: We acknowledge that the manuscript does not contain a dedicated quantitative comparison of consistency scores between correct and drifted-but-coherent hypotheses, nor dedicated failure-case analysis on this specific point. The cross-benchmark gains provide supporting evidence, but to directly address the concern we will add an ablation study and selected failure-case visualizations in the revised version. revision: yes

  2. Referee: [§4] Experimental evaluation (throughout §4 and tables): Gains and new SOTA claims are reported after re-evaluating baselines, but the manuscript does not specify the exact dataset splits, versions, or statistical significance tests used; without these, the reported improvements cannot be independently verified and the cross-benchmark consistency claim is difficult to assess.

    Authors: All evaluations followed the official benchmark splits and dataset versions released by the respective organizers. We will explicitly document these versions and splits in the revised manuscript. Statistical significance testing is uncommon in the tracking literature; we will add a note on result consistency across the nine benchmarks but do not plan to introduce new statistical tests unless required. revision: partial

  3. Referee: [§3.2] §3.2 (cycle-consistent matching): The aggregation of hypotheses into tracklets and scoring procedure is described at a high level, but lacks explicit equations or pseudocode for the neighbor-aware component; this makes it impossible to determine whether the method can assign high consistency scores when all hypotheses share a common error pattern (e.g., aligned distractor motion).

    Authors: We agree that the neighbor-aware component would benefit from a more formal description. The revised manuscript will include explicit equations and pseudocode for the full cycle-consistent matching procedure, together with a brief discussion of robustness to shared error patterns such as aligned distractor motion. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic module evaluated on external benchmarks

full rationale

The paper introduces SENTRY as a training-free plug-and-play module that replaces confidence-based memory writes with neighbor-aware cycle-consistent matching for short-horizon temporal validation. All reported gains are empirical results from integrating the module into existing baselines and measuring performance on independent external benchmarks (LaSOT, GOT-10k, VOT20/22, etc.). No equations, fitted parameters, or self-citation chains are present that would reduce any claimed result to an input by construction; the central contribution is an algorithmic filter whose validity is tested outside the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on a domain assumption about the reliability of short-horizon consistency rather than new mathematical axioms or fitted parameters.

axioms (1)
  • domain assumption Short-horizon temporal consistency via neighbor-aware cycle-consistent matching reliably identifies correct masks under occlusion and distractors.
    This premise underpins the entire refine-before-write logic described in the abstract.

pith-pipeline@v0.9.1-grok · 5813 in / 1143 out tokens · 17715 ms · 2026-06-26T00:25:38.427682+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 16 canonical work pages

  1. [1]

    Scientific Data11(1), 15 (2024)

    Alansari, M., Abdul Hay, O., Alansari, S., Javed, S., Shoufan, A., Zweiri, Y., Werghi, N.: Drone-person tracking in uniform appearance crowd: A new dataset. Scientific Data11(1), 15 (2024)

  2. [2]

    Information Fu- sion124, 103374 (2025).https : / / doi

    Alansari, M., Javed, S., Ganapathi, I.I., Alansari, S., Naseer, M.: Cldtracker: A comprehensive language description for visual tracking. Information Fu- sion124, 103374 (2025).https : / / doi . org / https : / / doi . org / 10 . 1016 / j . inffus.2025.103374,https://www.sciencedirect.com/science/article/pii/ S1566253525004476

  3. [3]

    DSFormer: A Dual -domain Self - supervised Transformer for Accelerated Multi -contrast MRI Reconstruction,

    Asanomi, T., Nishimura, K., Bise, R.: Multi-frame attention with feature-level warping for drone crowd tracking. In: 2023 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV). pp. 1664–1673 (2023).https://doi.org/ 10.1109/WACV56688.2023.00171

  4. [4]

    In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

    Bhat, G., Lawin, F.J., Danelljan, M., Robinson, A., Felsberg, M., Van Gool, L., Timofte, R.: Learning what to learn for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 777–794. Springer International Publishing, Cham (2020)

  5. [5]

    In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

    Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms – improving object de- tection with one line of code. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., Soatto, S.: Memot: Multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8090–8100 (June 2022)

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cai, W., Liu, Q., Wang, Y.: Hiptrack: Visual tracking with historical prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19258–19267 (June 2024)

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cai, W., Liu, Q., Wang, Y.: Spmtrack: Spatio-temporal parameter-efficient fine- tuning with mixture of experts for scalable visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16871–16881 (June 2025)

  9. [9]

    In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=r35clVtGzw 16 M

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Coll-Vinent, D.S., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., HAZRA, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng...

  10. [10]

    In: Kittler, J., Xiong, H., Yang, J., Chen, X., Lu, J., Lin, W., Yu, J., Zheng, W

    Chen, R., Sun, G., Li, Y., Qin, J., Benini, L.: Him2sam: Enhancing sam2 with hier- archical motion estimation and memory optimization towards long-term tracking. In: Kittler, J., Xiong, H., Yang, J., Chen, X., Lu, J., Lin, W., Yu, J., Zheng, W. (eds.) Pattern Recognition and Computer Vision. pp. 276–291. Springer Nature Singapore, Singapore (2026)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chen, X., Peng, H., Wang, D., Lu, H., Hu, H.: Seqtrack: Sequence to sequence learning for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14572–14581 (June 2023)

  12. [12]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 8126–8135 (2021)

  13. [13]

    In: Bartoli, A., Fusiello, A

    Chen, Y., Xu, J., Yu, J., Wang, Q., Yoo, B., Han, J.J.: Afod: Adaptive focused discriminative segmentation tracker. In: Bartoli, A., Fusiello, A. (eds.) Computer Vision – ECCV 2020 Workshops. pp. 666–682. Springer International Publishing, Cham (2020)

  14. [14]

    Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,

    Chen, Y.H., Wang, C.Y., Yang, C.Y., Chang, H.S., Lin, Y.L., Chuang, Y.Y., Liao, H.Y.M.: Neighbortrack: Single object tracking by bipartite matching with neigh- bor tracklets and its applications to sports. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 5139–5148 (2023).https://doi.org/10.1109/CVPRW59228.2023.00542

  15. [15]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cheng, H.K., Oh, S.W., Price, B., Lee, J.Y., Schwing, A.: Putting the object back into video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3151–3161 (June 2024)

  16. [16]

    In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

    Cheng, H.K., Schwing, A.G.: Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 640–658. Springer Nature Switzerland, Cham (2022)

  17. [17]

    In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

    Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with im- proved memory coverage for efficient video object segmentation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neu- ral Information Processing Systems. vol. 34, pp. 11781–11794. Curran Associates, Inc. (2021),https://proceedings.neurips....

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: End-to-end tracking with itera- tive mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13608–13618 (2022)

  19. [19]

    IEEE Transactions on Pattern Analysis and Machine Intelligence 46(6), 4129–4146 (2024).https://doi.org/10.1109/TPAMI.2024.3349519

    Cui,Y.,Jiang,C.,Wu,G.,Wang,L.:Mixformer:End-to-endtrackingwithiterative mixed attention. IEEE Transactions on Pattern Analysis and Machine Intelligence 46(6), 4129–4146 (2024).https://doi.org/10.1109/TPAMI.2024.3349519

  20. [20]

    2023 , url =

    Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: Mose: A new dataset for video object segmentation in complex scenes. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 20167–20177 (2023).https://doi. org/10.1109/ICCV51070.2023.01850

  21. [21]

    arXiv preprint arXiv:2410.16268 (2024)

    Ding, S., Qian, R., Dong, X., Zhang, P., Zang, Y., Cao, Y., Guo, Y., Lin, D., Wang, J.: Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree. arXiv preprint arXiv:2410.16268 (2024)

  22. [22]

    International Journal of Computer Vision129(2), 439–461 (2021) SENTRY 17

    Fan, H., Bai, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Harshit, Huang, M., Liu, J., et al.: Lasot: A high-quality large-scale single object tracking benchmark. International Journal of Computer Vision129(2), 439–461 (2021) SENTRY 17

  23. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

    Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H.: Lasot: A high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

  24. [24]

    In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

    Feng, X., Li, X., Hu, S., Zhang, D., Wu, M., Zhang, J., Chen, X., Huang, K.: Memvlt: Vision-language tracking with adaptive memory-based prompts. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems. vol. 37, pp. 14903–14933. Curran Associates, Inc. (2024),https:/...

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Fu, Z., Liu, Q., Fu, Z., Wang, Y.: Stmtrack: Template-free visual tracking with space-time memory networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13774–13783 (June 2021)

  26. [26]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Hong, L., Chen, W., Liu, Z., Zhang, W., Guo, P., Chen, Z., Zhang, W.: Lvos: A benchmark for long-term video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13480– 13492 (October 2023)

  27. [27]

    IEEE Transactions on Pattern Analysis and Machine Intelli- gence48(1), 946–961 (2026).https://doi.org/10.1109/TPAMI.2025.3611020

    Hong, L., Liu, Z., Chen, W., Tan, C., Feng, Y., Zhou, X., Guo, P., Li, J., Chen, Z., Gao, S., Zhang, W., Zhang, W.: Lvos: A benchmark for large-scale long-term video object segmentation. IEEE Transactions on Pattern Analysis and Machine Intelli- gence48(1), 946–961 (2026).https://doi.org/10.1109/TPAMI.2025.3611020

  28. [28]

    IEEE Transactions on Pattern Analysis and Machine Intelligence43(5), 1562–1577 (2021)

    Huang, L., Zhao, X., Huang, K.: Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence43(5), 1562–1577 (2021)

  29. [29]

    2024 , pages =

    Huang, Y., Li, X., Zhou, Z., Wang, Y., He, Z., Yang, M.H.: Rtracker: Recov- erable tracking via pn tree structured memory. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19038–19047 (2024). https://doi.org/10.1109/CVPR52733.2024.01801

  30. [30]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4015–4026 (October 2023)

  31. [31]

    Alansari, Y

    Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J.K., Chang, H.J., Danelljan, M., Zajc, L.Č., Lukežič, A., Drbohlav, O., Björklund, J., Zhang, Y., Zhang, Z., Yan, S., Yang, W., Cai, D., Mayer, C., Fernández, G., Ben, K., Bhat, G., Chang, H., Chen, G., Chen, J., Chen, S., Chen, X., Chen, X., Chen, X., Chen, Y., Chen, Y.H.,...

  32. [32]

    In: Bartoli, A., Fusiello, A

    Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J.K., Danelljan, M., Zajc, L.Č., Lukežič, A., Drbohlav, O., He, L., Zhang, Y., Yan, S., Yang, J., Fernández, G., Hauptmann, A., Memarmoghadam, A., García-Martín, Á., Robinson, A., Varfolomieiev, A., Gebrehiwot, A.H., Uzun, B., Yan, B., Li, B., Qian, C., Tsai, C.Y., Micheloni...

  33. [33]

    In: Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T

    Kristan, M., Matas, J., Tokmakov, P., Felsberg, M., Zajc, L.Č., Lukežič, A., Tran, K.T., Vu, X.S., Björklund, J., Chang, H.J., Fernández, G., Attari, M., Chan, A., Chen, L., Chen, X., Collins, J., Cui, Y., Devarapu, G.S.M., Du, Y., Fan, H., Fan, W.C., Feng, Z., Gao, M., Gorthi, R.K.S., Goyal, R., Han, J., Hatuwal, B., He, Z., Hu, X., Huang, X., Huang, Y.,...

  34. [34]

    In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

    Kristan, M., Matas, J., Danelljan, M., Felsberg, M., Chang, H.J., Čehovin Zajc, L., Lukežič, A., Drbohlav, O., Zhang, Z., Tran, K.T., Vu, X.S., Björklund, J., Mayer, C., Zhang, Y., Ke, L., Zhao, J., Fernández, G., Al-Shakarji, N., An, D., Arens, M., Becker, S., Bhat, G., Bullinger, S., Chan, A.B., Chang, S., Chen, H., Chen, X., Chen, Y., Chen, Z., Cheng, ...

  35. [35]

    In: Proceedingsof the IEEE/CVFConference on ComputerVision and Pattern Recognition (CVPR) Workshops (June 2020)

    Li, C., Yang, T., Zhu, S., Chen, C., Guan, S.: Density map guided object detection in aerial images. In: Proceedingsof the IEEE/CVFConference on ComputerVision and Pattern Recognition (CVPR) Workshops (June 2020)

  36. [36]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Li, X., Zhong, B., Liang, Q., Mo, Z., Nong, J., Song, S.: Dynamic updates for language adaptation in visual-language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19165– 19174 (June 2025)

  37. [37]

    In: The Thirteenth International Conferenceon LearningRepresentations(2025),https://openreview.net/forum? id=EM93t94zEi

    Li, X., Miao, D., He, Z., Wang, Y., Lu, H., Yang, M.H.: Learning spatial-semantic features for robust video object segmentation. In: The Thirteenth International Conferenceon LearningRepresentations(2025),https://openreview.net/forum? id=EM93t94zEi

  38. [38]

    In: 2025 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)

    Liang, S., Bai, Y., Gong, Y., Wei, X.: Autoregressive sequential pretraining for visual tracking. In: 2025 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR). pp. 7254–7264 (2025).https://doi.org/10.1109/ CVPR52734.2025.00680

  39. [39]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Lin, L., Fan, H., Zhang, Z., Wang, Y., Xu, Y., Ling, H.: Tracking meets lora: Faster training, larger model, stronger performance. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 300–318. Springer Nature Switzerland, Cham (2025)

  40. [40]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 38–55. Springer Nature Switzerland,...

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Liu, X., Zhou, L., Zhou, Z., Chen, J., He, Z.: Mambavlt: Time-evolving multimodal state space model for vision-language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8731–8741 (June 2025)

  42. [42]

    2024.3445770

    Lukežič, A., Matas, J., Kristan, M.: A discriminative single-shot segmentation net- work for visual object tracking. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence44(12), 9742–9755 (2022).https://doi.org/10.1109/TPAMI. 2021.3137933

  43. [43]

    Ma, Y., Tang, Y., Yang, W., Zhang, T., Zhang, J., Kang, M.: Unifying vi- sual and vision-language tracking via contrastive learning. Proceedings of the AAAI Conference on Artificial Intelligence38(5), 4107–4116 (Mar 2024).https: //doi.org/10.1609/aaai.v38i5.28205,https://ojs.aaai.org/index.php/ AAAI/article/view/28205

  44. [44]

    In: Bartoli, A., Fusiello, A

    Ma, Z., Wang, L., Zhang, H., Lu, W., Yin, J.: Rpt: Learning point set representa- tion for siamese visual tracking. In: Bartoli, A., Fusiello, A. (eds.) Computer Vision 20 M. Alansari, Y. Michael et al. – ECCV 2020 Workshops. pp. 653–665. Springer International Publishing, Cham (2020)

  45. [45]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Mayer, C., Danelljan, M., Paudel, D.P., Van Gool, L.: Learning target candidate association to keep track of what not to track. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13444–13454 (October 2021)

  46. [46]

    Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,

    Meethal, A., Granger, E., Pedersoli, M.: Cascaded zoom-in detector for high resolution aerial images. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 2046–2055 (2023).https: //doi.org/10.1109/CVPRW59228.2023.00198

  47. [47]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018)

    Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: Trackingnet: A large- scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018)

  48. [48]

    Journal of the Society for Industrial and Applied Mathematics5(1), 32–38 (1957).https: //doi.org/10.1137/0105003,https://doi.org/10.1137/0105003

    Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics5(1), 32–38 (1957).https: //doi.org/10.1137/0105003,https://doi.org/10.1137/0105003

  49. [49]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

    Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

  50. [50]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Qin, H., Xu, T., Li, T., Chen, Z., Feng, T., Li, J.: Must: The first dataset and unified framework for multispectral uav single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16882–16891 (June 2025)

  51. [51]

    In: The Thirteenth International Conference on Learning Representations (2025)

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollar, P., Feichtenhofer, C.: SAM 2: Segment anything in images and videos. In: The Thirteenth International Conference on Learning Representations (2025)

  52. [52]

    In: 2018 IEEE/CVF Conference onComputerVisionandPatternRecognition.pp.5353–5362(2018).https://doi

    Ren, W., Kang, D., Tang, Y., Chan, A.B.: Fusing crowd density maps and visual object trackers for people tracking in crowd scenes. In: 2018 IEEE/CVF Conference onComputerVisionandPatternRecognition.pp.5353–5362(2018).https://doi. org/10.1109/CVPR.2018.00561

  53. [53]

    In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

    Ryali, C., Hu, Y.T., Bolya, D., Wei, C., Fan, H., Huang, P.Y., Aggarwal, V., Chowdhury, A., Poursaeed, O., Hoffman, J., Malik, J., Li, Y., Feichtenhofer, C.: Hiera: A hierarchical vision transformer without the bells-and-whistles. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceed- ings of the 40th Internationa...

  54. [54]

    In: 2017 IEEE International Conference on Computer Vision (ICCV)

    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 618–626 (2017).https://doi.org/10.1109/ICCV.2017.74

  55. [55]

    In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

    Seong, H., Hyun, J., Kim, E.: Kernelized memory network for video object segmen- tation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 629–645. Springer International Publishing, Cham (2020)

  56. [56]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Shao, Y., He, S., Ye, Q., Feng, Y., Luo, W., Chen, J.: Context-aware integration of language and visual references for natural language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19208–19217 (June 2024) SENTRY 21

  57. [57]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Sun,P.,Cao,J.,Jiang,Y., Yuan,Z.,Bai,S., Kitani,K.,Luo,P.:Dancetrack:Multi- object tracking in uniform appearance and diverse motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20993–21002 (June 2022)

  58. [58]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)

    Videnovic, J., Lukezic, A., Kristan, M.: A distractor-aware memory for visual ob- ject tracking with sam2. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp. 24255–24264 (June 2025)

  59. [59]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

    Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

  60. [60]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, X., Shu, X., Zhang, Z., Jiang, B., Wang, Y., Tian, Y., Wu, F.: Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13763–13773 (2021)

  61. [61]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wen, L., Du, D., Zhu, P., Hu, Q., Wang, Q., Bo, L., Lyu, S.: Detection, track- ing, and counting meets drones in crowds: A benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7812–7821 (June 2021)

  62. [62]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wu, Y., Wang, X., Yang, X., Liu, M., Zeng, D., Ye, H., Li, S.: Learning occlusion- robust vision transformers for real-time uav tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17103–17113 (June 2025)

  63. [63]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xie, F., Wang, Z., Ma, C.: Diffusiontrack: Point set diffusion model for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19113–19124 (June 2024)

  64. [64]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xie, J., Zhong, B., Mo, Z., Zhang, S., Shi, L., Song, S., Ji, R.: Autoregressive queries for adaptive tracking with spatio-temporal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19300–19309 (June 2024)

  65. [65]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Xiong, Y., Zhou, C., Xiang, X., Wu, L., Zhu, C., Liu, Z., Suri, S., Varadarajan, B., Akula, R., Iandola, F., Krishnamoorthi, R., Soran, B., Chandra, V.: Efficient track anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11513–11524 (October 2025)

  66. [66]

    arXiv preprint arXiv:2507.21732 (2025)

    Xu, Q., Zhu, L., Liu, C., Lin, G., Long, C., Li, Z., Zhao, R.: Samite: Position prompted sam2 with calibrated memory for visual object tracking. arXiv preprint arXiv:2507.21732 (2025)

  67. [67]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xue, C., Zhong, B., Liang, Q., Zheng, Y., Li, N., Xue, Y., Song, S.: Similarity- guided layer-adaptive vision transformer for uav tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6730–6740 (June 2025)

  68. [68]

    IEEE Transactions on Multime- dia26, 6228–6237 (2024).https://doi.org/10.1109/TMM.2023.3347644

    Xun, Z., Di, S., Gao, Y., Tang, Z., Wang, G., Liu, S., Li, B.: Linker: Learning long short-term associations for robust visual tracking. IEEE Transactions on Multime- dia26, 6228–6237 (2024).https://doi.org/10.1109/TMM.2023.3347644

  69. [69]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10448–10457 (2021)

  70. [70]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Yan, B., Zhang, X., Wang, D., Lu, H., Yang, X.: Alpha-refine: Boosting track- ing performance by precise bounding box estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5289–5298 (June 2021) 22 M. Alansari, Y. Michael et al

  71. [71]

    Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou

    Yang, C.Y., Huang, H.W., Chai, W., Jiang, Z., Hwang, J.N.: Samurai: Motion- aware memory for training-free visual object tracking with sam 2. IEEE Transac- tions on Image Processing35, 970–982 (2026).https://doi.org/10.1109/TIP. 2026.3651835

  72. [72]

    In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

    Yang, Z., Wei, Y., Yang, Y.: Associating objects with transformers for video object segmentation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 2491–

  73. [73]

    (2021),https://proceedings.neurips.cc/paper_ files/paper/2021/file/147702db07145348245dc5a2f2fe5683-Paper.pdf

    Curran Associates, Inc. (2021),https://proceedings.neurips.cc/paper_ files/paper/2021/file/147702db07145348245dc5a2f2fe5683-Paper.pdf

  74. [75]

    In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

    Yang, Z., Yang, Y.: Decoupling features in hierarchical propagation for video ob- ject segmentation. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 36324–36336. Curran Associates, Inc. (2022),https://proceedings.neurips.cc/ paper _ files / paper / 2022 / file /...

  75. [76]

    In: European conference on computer vision

    Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and rela- tion modeling for tracking: A one-stream framework. In: European conference on computer vision. pp. 341–357. Springer (2022)

  76. [77]

    In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

    Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W.: Ocean: Object-aware anchor-free track- ing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 771–787. Springer International Publishing, Cham (2020)

  77. [78]

    Zheng, Y., Zhong, B., Liang, Q., Mo, Z., Zhang, S., Li, X.: Odtrack: Online dense temporal token learning for visual tracking. Proceedings of the AAAI Confer- ence on Artificial Intelligence38(7), 7588–7596 (Mar 2024).https://doi.org/ 10.1609/aaai.v38i7.28591,https://ojs.aaai.org/index.php/AAAI/article/ view/28591

  78. [79]

    non-negligible distractor

    Zhou, J., Pang, Z., Wang, Y.X.: Rmem: Restricted memory banks improve video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18602–18611 (June 2024) SENTRY 23 Appendix A Additional Ablation Study ...................................... 23 B Memory-Write Diagnostic Analysis .................