SENTRY: SAM2-Enhanced Neighbor-Aware and Temporally Reasoned Memory for Visual Tracking
Pith reviewed 2026-06-26 00:25 UTC · model grok-4.3
The pith
Replacing confidence-only memory writes with neighbor-aware temporal consistency checks stabilizes SAM2-based visual trackers without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SENTRY is a refine-before-write module that aggregates diverse segmentation hypotheses, backtracks them into short tracklets, and uses neighbor-aware cycle-consistent matching to enforce short-horizon temporal and geometric consistency before committing any mask to memory, replacing confidence-driven writes in unmodified SAM2 architectures and yielding consistent gains across nine benchmarks with new zero-shot state-of-the-art results on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi.
What carries the argument
The SENTRY refine-before-write module that validates memory updates via neighbor-aware cycle-consistent matching on short tracklets formed from multiple per-frame segmentation hypotheses.
If this is right
- Consistent gains appear when SENTRY is added to five strong SAM2 baselines across nine benchmarks.
- New zero-shot state-of-the-art results are reached on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi.
- The SAM2-L version maintains 32.8 FPS on A100 hardware with only 0.4-0.6 GB added VRAM.
- The first unified all-scale evaluation of SAM2-based trackers is provided.
- Enforcing temporal validity at write time stabilizes memory-augmented tracking without retraining.
Where Pith is reading between the lines
- The same write-time validation pattern could apply to memory mechanisms in other video models that rely on stored features or masks.
- Testing longer tracklet horizons or additional geometric constraints would show whether the current short-horizon choice is optimal or merely sufficient.
- Similar consistency checks might reduce drift in related tasks such as video object segmentation or multi-object tracking.
- Many memory-augmented systems may benefit more from stricter write rules than from later correction stages.
Load-bearing premise
Short-horizon temporal consistency measured by neighbor-aware cycle-consistent matching is a sufficient proxy for identifying correct segmentation masks under occlusion, rapid motion, and distractors.
What would settle it
Integrating SENTRY into the five evaluated baselines and measuring no consistent improvement on the nine benchmarks would show that the temporal validation step does not stabilize tracking.
Figures
read the original abstract
We revisit the memory update mechanism in SAM2-based visual object tracking and identify confidence-only mask selection as the dominant cause of drift under occlusion, rapid motion, and distractors. We introduce SENTRY, a training-free, plug-and-play, refine-before-write module that validates each memory update for short-horizon temporal consistency before committing it. SENTRY aggregates diverse segmentation hypotheses per frame, backtracks them into short tracklets, and uses neighbor-aware cycle-consistent matching against recent trajectories to favor temporally and geometrically consistent masks. It leaves the base architecture untouched, replacing confidence-driven writes with consistency-validated ones. For fair evaluation, we re-evaluate major open-source SAM2-based trackers across all available scales and datasets, filling gaps in prior reports. Integrated into five strong baselines, SENTRY delivers consistent gains across nine benchmarks, achieving new zero-shot SOTA on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi. Despite these checks, the SAM2-L version runs at 32.8 FPS on an A100, and across compatible hosts adds only about 0.4--0.6 GB VRAM. Our results provide the first unified all-scale evaluation of SAM2-based trackers and show that enforcing temporal validity at write time stabilizes memory-augmented tracking without retraining. Project page: https://hamadya.github.io/SENTRY/page/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that confidence-only mask selection in SAM2-based trackers causes drift under occlusion, rapid motion, and distractors; SENTRY is a training-free, plug-and-play module that aggregates segmentation hypotheses, forms short tracklets, and applies neighbor-aware cycle-consistent matching to validate temporal and geometric consistency before memory writes. Replacing confidence-driven updates with these validated writes into five baselines yields consistent gains across nine benchmarks and new zero-shot SOTA on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi, while adding negligible overhead (32.8 FPS and 0.4-0.6 GB VRAM for SAM2-L) and providing the first unified all-scale evaluation of SAM2 trackers.
Significance. If the empirical results hold, the work demonstrates that enforcing short-horizon temporal validity at write time can stabilize memory-augmented tracking without retraining or architectural changes, offering a lightweight, generalizable improvement for foundation-model trackers. The fair re-evaluation of open-source baselines across scales and the reporting of runtime/VRAM metrics are concrete strengths that aid reproducibility and comparison in the field.
major comments (3)
- [§3] §3 (SENTRY mechanism): The central claim that neighbor-aware cycle-consistent matching reliably selects correct masks rests on the untested assumption that short-horizon consistency is a sufficient proxy under occlusion/rapid motion/distractors; no ablation, failure-case analysis, or quantitative comparison of scores for true vs. drifted but temporally coherent hypotheses is provided to substantiate this load-bearing step.
- [§4] Experimental evaluation (throughout §4 and tables): Gains and new SOTA claims are reported after re-evaluating baselines, but the manuscript does not specify the exact dataset splits, versions, or statistical significance tests used; without these, the reported improvements cannot be independently verified and the cross-benchmark consistency claim is difficult to assess.
- [§3.2] §3.2 (cycle-consistent matching): The aggregation of hypotheses into tracklets and scoring procedure is described at a high level, but lacks explicit equations or pseudocode for the neighbor-aware component; this makes it impossible to determine whether the method can assign high consistency scores when all hypotheses share a common error pattern (e.g., aligned distractor motion).
minor comments (2)
- [Abstract] Abstract and §4: The phrase 'new zero-shot SOTA' should be accompanied by explicit comparison tables showing the previous best scores and the exact margins achieved.
- [§4] Figure captions and §4: Several result tables lack error bars or run-to-run variance, which would help contextualize the reported gains.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive review. We address each major comment below with clarifications and indicate the revisions planned for the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (SENTRY mechanism): The central claim that neighbor-aware cycle-consistent matching reliably selects correct masks rests on the untested assumption that short-horizon consistency is a sufficient proxy under occlusion/rapid motion/distractors; no ablation, failure-case analysis, or quantitative comparison of scores for true vs. drifted but temporally coherent hypotheses is provided to substantiate this load-bearing step.
Authors: We acknowledge that the manuscript does not contain a dedicated quantitative comparison of consistency scores between correct and drifted-but-coherent hypotheses, nor dedicated failure-case analysis on this specific point. The cross-benchmark gains provide supporting evidence, but to directly address the concern we will add an ablation study and selected failure-case visualizations in the revised version. revision: yes
-
Referee: [§4] Experimental evaluation (throughout §4 and tables): Gains and new SOTA claims are reported after re-evaluating baselines, but the manuscript does not specify the exact dataset splits, versions, or statistical significance tests used; without these, the reported improvements cannot be independently verified and the cross-benchmark consistency claim is difficult to assess.
Authors: All evaluations followed the official benchmark splits and dataset versions released by the respective organizers. We will explicitly document these versions and splits in the revised manuscript. Statistical significance testing is uncommon in the tracking literature; we will add a note on result consistency across the nine benchmarks but do not plan to introduce new statistical tests unless required. revision: partial
-
Referee: [§3.2] §3.2 (cycle-consistent matching): The aggregation of hypotheses into tracklets and scoring procedure is described at a high level, but lacks explicit equations or pseudocode for the neighbor-aware component; this makes it impossible to determine whether the method can assign high consistency scores when all hypotheses share a common error pattern (e.g., aligned distractor motion).
Authors: We agree that the neighbor-aware component would benefit from a more formal description. The revised manuscript will include explicit equations and pseudocode for the full cycle-consistent matching procedure, together with a brief discussion of robustness to shared error patterns such as aligned distractor motion. revision: yes
Circularity Check
No circularity: algorithmic module evaluated on external benchmarks
full rationale
The paper introduces SENTRY as a training-free plug-and-play module that replaces confidence-based memory writes with neighbor-aware cycle-consistent matching for short-horizon temporal validation. All reported gains are empirical results from integrating the module into existing baselines and measuring performance on independent external benchmarks (LaSOT, GOT-10k, VOT20/22, etc.). No equations, fitted parameters, or self-citation chains are present that would reduce any claimed result to an input by construction; the central contribution is an algorithmic filter whose validity is tested outside the method itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Short-horizon temporal consistency via neighbor-aware cycle-consistent matching reliably identifies correct masks under occlusion and distractors.
Reference graph
Works this paper leans on
-
[1]
Scientific Data11(1), 15 (2024)
Alansari, M., Abdul Hay, O., Alansari, S., Javed, S., Shoufan, A., Zweiri, Y., Werghi, N.: Drone-person tracking in uniform appearance crowd: A new dataset. Scientific Data11(1), 15 (2024)
2024
-
[2]
Information Fu- sion124, 103374 (2025).https : / / doi
Alansari, M., Javed, S., Ganapathi, I.I., Alansari, S., Naseer, M.: Cldtracker: A comprehensive language description for visual tracking. Information Fu- sion124, 103374 (2025).https : / / doi . org / https : / / doi . org / 10 . 1016 / j . inffus.2025.103374,https://www.sciencedirect.com/science/article/pii/ S1566253525004476
arXiv 2025
-
[3]
Asanomi, T., Nishimura, K., Bise, R.: Multi-frame attention with feature-level warping for drone crowd tracking. In: 2023 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV). pp. 1664–1673 (2023).https://doi.org/ 10.1109/WACV56688.2023.00171
-
[4]
In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M
Bhat, G., Lawin, F.J., Danelljan, M., Robinson, A., Felsberg, M., Van Gool, L., Timofte, R.: Learning what to learn for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 777–794. Springer International Publishing, Cham (2020)
2020
-
[5]
In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms – improving object de- tection with one line of code. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
2017
-
[6]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., Soatto, S.: Memot: Multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8090–8100 (June 2022)
2022
-
[7]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Cai, W., Liu, Q., Wang, Y.: Hiptrack: Visual tracking with historical prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19258–19267 (June 2024)
2024
-
[8]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Cai, W., Liu, Q., Wang, Y.: Spmtrack: Spatio-temporal parameter-efficient fine- tuning with mixture of experts for scalable visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16871–16881 (June 2025)
2025
-
[9]
In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=r35clVtGzw 16 M
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Coll-Vinent, D.S., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., HAZRA, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng...
2026
-
[10]
In: Kittler, J., Xiong, H., Yang, J., Chen, X., Lu, J., Lin, W., Yu, J., Zheng, W
Chen, R., Sun, G., Li, Y., Qin, J., Benini, L.: Him2sam: Enhancing sam2 with hier- archical motion estimation and memory optimization towards long-term tracking. In: Kittler, J., Xiong, H., Yang, J., Chen, X., Lu, J., Lin, W., Yu, J., Zheng, W. (eds.) Pattern Recognition and Computer Vision. pp. 276–291. Springer Nature Singapore, Singapore (2026)
2026
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chen, X., Peng, H., Wang, D., Lu, H., Hu, H.: Seqtrack: Sequence to sequence learning for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14572–14581 (June 2023)
2023
-
[12]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 8126–8135 (2021)
2021
-
[13]
In: Bartoli, A., Fusiello, A
Chen, Y., Xu, J., Yu, J., Wang, Q., Yoo, B., Han, J.J.: Afod: Adaptive focused discriminative segmentation tracker. In: Bartoli, A., Fusiello, A. (eds.) Computer Vision – ECCV 2020 Workshops. pp. 666–682. Springer International Publishing, Cham (2020)
2020
-
[14]
Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,
Chen, Y.H., Wang, C.Y., Yang, C.Y., Chang, H.S., Lin, Y.L., Chuang, Y.Y., Liao, H.Y.M.: Neighbortrack: Single object tracking by bipartite matching with neigh- bor tracklets and its applications to sports. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 5139–5148 (2023).https://doi.org/10.1109/CVPRW59228.2023.00542
-
[15]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Cheng, H.K., Oh, S.W., Price, B., Lee, J.Y., Schwing, A.: Putting the object back into video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3151–3161 (June 2024)
2024
-
[16]
In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T
Cheng, H.K., Schwing, A.G.: Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 640–658. Springer Nature Switzerland, Cham (2022)
2022
-
[17]
In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W
Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with im- proved memory coverage for efficient video object segmentation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neu- ral Information Processing Systems. vol. 34, pp. 11781–11794. Curran Associates, Inc. (2021),https://proceedings.neurips....
2021
-
[18]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: End-to-end tracking with itera- tive mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13608–13618 (2022)
2022
-
[19]
Cui,Y.,Jiang,C.,Wu,G.,Wang,L.:Mixformer:End-to-endtrackingwithiterative mixed attention. IEEE Transactions on Pattern Analysis and Machine Intelligence 46(6), 4129–4146 (2024).https://doi.org/10.1109/TPAMI.2024.3349519
-
[20]
Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: Mose: A new dataset for video object segmentation in complex scenes. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 20167–20177 (2023).https://doi. org/10.1109/ICCV51070.2023.01850
-
[21]
arXiv preprint arXiv:2410.16268 (2024)
Ding, S., Qian, R., Dong, X., Zhang, P., Zang, Y., Cao, Y., Guo, Y., Lin, D., Wang, J.: Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree. arXiv preprint arXiv:2410.16268 (2024)
arXiv 2024
-
[22]
International Journal of Computer Vision129(2), 439–461 (2021) SENTRY 17
Fan, H., Bai, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Harshit, Huang, M., Liu, J., et al.: Lasot: A high-quality large-scale single object tracking benchmark. International Journal of Computer Vision129(2), 439–461 (2021) SENTRY 17
2021
-
[23]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H.: Lasot: A high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
2019
-
[24]
In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C
Feng, X., Li, X., Hu, S., Zhang, D., Wu, M., Zhang, J., Chen, X., Huang, K.: Memvlt: Vision-language tracking with adaptive memory-based prompts. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems. vol. 37, pp. 14903–14933. Curran Associates, Inc. (2024),https:/...
2024
-
[25]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Fu, Z., Liu, Q., Fu, Z., Wang, Y.: Stmtrack: Template-free visual tracking with space-time memory networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13774–13783 (June 2021)
2021
-
[26]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Hong, L., Chen, W., Liu, Z., Zhang, W., Guo, P., Chen, Z., Zhang, W.: Lvos: A benchmark for long-term video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13480– 13492 (October 2023)
2023
-
[27]
Hong, L., Liu, Z., Chen, W., Tan, C., Feng, Y., Zhou, X., Guo, P., Li, J., Chen, Z., Gao, S., Zhang, W., Zhang, W.: Lvos: A benchmark for large-scale long-term video object segmentation. IEEE Transactions on Pattern Analysis and Machine Intelli- gence48(1), 946–961 (2026).https://doi.org/10.1109/TPAMI.2025.3611020
-
[28]
IEEE Transactions on Pattern Analysis and Machine Intelligence43(5), 1562–1577 (2021)
Huang, L., Zhao, X., Huang, K.: Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence43(5), 1562–1577 (2021)
2021
-
[29]
Huang, Y., Li, X., Zhou, Z., Wang, Y., He, Z., Yang, M.H.: Rtracker: Recov- erable tracking via pn tree structured memory. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19038–19047 (2024). https://doi.org/10.1109/CVPR52733.2024.01801
-
[30]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4015–4026 (October 2023)
2023
-
[31]
Alansari, Y
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J.K., Chang, H.J., Danelljan, M., Zajc, L.Č., Lukežič, A., Drbohlav, O., Björklund, J., Zhang, Y., Zhang, Z., Yan, S., Yang, W., Cai, D., Mayer, C., Fernández, G., Ben, K., Bhat, G., Chang, H., Chen, G., Chen, J., Chen, S., Chen, X., Chen, X., Chen, X., Chen, Y., Chen, Y.H.,...
2022
-
[32]
In: Bartoli, A., Fusiello, A
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J.K., Danelljan, M., Zajc, L.Č., Lukežič, A., Drbohlav, O., He, L., Zhang, Y., Yan, S., Yang, J., Fernández, G., Hauptmann, A., Memarmoghadam, A., García-Martín, Á., Robinson, A., Varfolomieiev, A., Gebrehiwot, A.H., Uzun, B., Yan, B., Li, B., Qian, C., Tsai, C.Y., Micheloni...
2020
-
[33]
In: Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T
Kristan, M., Matas, J., Tokmakov, P., Felsberg, M., Zajc, L.Č., Lukežič, A., Tran, K.T., Vu, X.S., Björklund, J., Chang, H.J., Fernández, G., Attari, M., Chan, A., Chen, L., Chen, X., Collins, J., Cui, Y., Devarapu, G.S.M., Du, Y., Fan, H., Fan, W.C., Feng, Z., Gao, M., Gorthi, R.K.S., Goyal, R., Han, J., Hatuwal, B., He, Z., Hu, X., Huang, X., Huang, Y.,...
2024
-
[34]
In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Kristan, M., Matas, J., Danelljan, M., Felsberg, M., Chang, H.J., Čehovin Zajc, L., Lukežič, A., Drbohlav, O., Zhang, Z., Tran, K.T., Vu, X.S., Björklund, J., Mayer, C., Zhang, Y., Ke, L., Zhao, J., Fernández, G., Al-Shakarji, N., An, D., Arens, M., Becker, S., Bhat, G., Bullinger, S., Chan, A.B., Chang, S., Chen, H., Chen, X., Chen, Y., Chen, Z., Cheng, ...
-
[35]
In: Proceedingsof the IEEE/CVFConference on ComputerVision and Pattern Recognition (CVPR) Workshops (June 2020)
Li, C., Yang, T., Zhu, S., Chen, C., Guan, S.: Density map guided object detection in aerial images. In: Proceedingsof the IEEE/CVFConference on ComputerVision and Pattern Recognition (CVPR) Workshops (June 2020)
2020
-
[36]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Li, X., Zhong, B., Liang, Q., Mo, Z., Nong, J., Song, S.: Dynamic updates for language adaptation in visual-language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19165– 19174 (June 2025)
2025
-
[37]
In: The Thirteenth International Conferenceon LearningRepresentations(2025),https://openreview.net/forum? id=EM93t94zEi
Li, X., Miao, D., He, Z., Wang, Y., Lu, H., Yang, M.H.: Learning spatial-semantic features for robust video object segmentation. In: The Thirteenth International Conferenceon LearningRepresentations(2025),https://openreview.net/forum? id=EM93t94zEi
2025
-
[38]
In: 2025 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)
Liang, S., Bai, Y., Gong, Y., Wei, X.: Autoregressive sequential pretraining for visual tracking. In: 2025 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR). pp. 7254–7264 (2025).https://doi.org/10.1109/ CVPR52734.2025.00680
arXiv 2025
-
[39]
In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G
Lin, L., Fan, H., Zhang, Z., Wang, Y., Xu, Y., Ling, H.: Tracking meets lora: Faster training, larger model, stronger performance. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 300–318. Springer Nature Switzerland, Cham (2025)
2024
-
[40]
In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 38–55. Springer Nature Switzerland,...
2024
-
[41]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Liu, X., Zhou, L., Zhou, Z., Chen, J., He, Z.: Mambavlt: Time-evolving multimodal state space model for vision-language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8731–8741 (June 2025)
2025
-
[42]
Lukežič, A., Matas, J., Kristan, M.: A discriminative single-shot segmentation net- work for visual object tracking. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence44(12), 9742–9755 (2022).https://doi.org/10.1109/TPAMI. 2021.3137933
-
[43]
Ma, Y., Tang, Y., Yang, W., Zhang, T., Zhang, J., Kang, M.: Unifying vi- sual and vision-language tracking via contrastive learning. Proceedings of the AAAI Conference on Artificial Intelligence38(5), 4107–4116 (Mar 2024).https: //doi.org/10.1609/aaai.v38i5.28205,https://ojs.aaai.org/index.php/ AAAI/article/view/28205
-
[44]
In: Bartoli, A., Fusiello, A
Ma, Z., Wang, L., Zhang, H., Lu, W., Yin, J.: Rpt: Learning point set representa- tion for siamese visual tracking. In: Bartoli, A., Fusiello, A. (eds.) Computer Vision 20 M. Alansari, Y. Michael et al. – ECCV 2020 Workshops. pp. 653–665. Springer International Publishing, Cham (2020)
2020
-
[45]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Mayer, C., Danelljan, M., Paudel, D.P., Van Gool, L.: Learning target candidate association to keep track of what not to track. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13444–13454 (October 2021)
2021
-
[46]
Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,
Meethal, A., Granger, E., Pedersoli, M.: Cascaded zoom-in detector for high resolution aerial images. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 2046–2055 (2023).https: //doi.org/10.1109/CVPRW59228.2023.00198
-
[47]
In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018)
Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: Trackingnet: A large- scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018)
2018
-
[48]
Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics5(1), 32–38 (1957).https: //doi.org/10.1137/0105003,https://doi.org/10.1137/0105003
-
[49]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)
2019
-
[50]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Qin, H., Xu, T., Li, T., Chen, Z., Feng, T., Li, J.: Must: The first dataset and unified framework for multispectral uav single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16882–16891 (June 2025)
2025
-
[51]
In: The Thirteenth International Conference on Learning Representations (2025)
Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollar, P., Feichtenhofer, C.: SAM 2: Segment anything in images and videos. In: The Thirteenth International Conference on Learning Representations (2025)
2025
-
[52]
In: 2018 IEEE/CVF Conference onComputerVisionandPatternRecognition.pp.5353–5362(2018).https://doi
Ren, W., Kang, D., Tang, Y., Chan, A.B.: Fusing crowd density maps and visual object trackers for people tracking in crowd scenes. In: 2018 IEEE/CVF Conference onComputerVisionandPatternRecognition.pp.5353–5362(2018).https://doi. org/10.1109/CVPR.2018.00561
-
[53]
In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J
Ryali, C., Hu, Y.T., Bolya, D., Wei, C., Fan, H., Huang, P.Y., Aggarwal, V., Chowdhury, A., Poursaeed, O., Hoffman, J., Malik, J., Li, Y., Feichtenhofer, C.: Hiera: A hierarchical vision transformer without the bells-and-whistles. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceed- ings of the 40th Internationa...
2023
-
[54]
In: 2017 IEEE International Conference on Computer Vision (ICCV)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 618–626 (2017).https://doi.org/10.1109/ICCV.2017.74
-
[55]
In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M
Seong, H., Hyun, J., Kim, E.: Kernelized memory network for video object segmen- tation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 629–645. Springer International Publishing, Cham (2020)
2020
-
[56]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Shao, Y., He, S., Ye, Q., Feng, Y., Luo, W., Chen, J.: Context-aware integration of language and visual references for natural language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19208–19217 (June 2024) SENTRY 21
2024
-
[57]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Sun,P.,Cao,J.,Jiang,Y., Yuan,Z.,Bai,S., Kitani,K.,Luo,P.:Dancetrack:Multi- object tracking in uniform appearance and diverse motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20993–21002 (June 2022)
2022
-
[58]
In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)
Videnovic, J., Lukezic, A., Kristan, M.: A distractor-aware memory for visual ob- ject tracking with sam2. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp. 24255–24264 (June 2025)
2025
-
[59]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
2019
-
[60]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, X., Shu, X., Zhang, Z., Jiang, B., Wang, Y., Tian, Y., Wu, F.: Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13763–13773 (2021)
2021
-
[61]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Wen, L., Du, D., Zhu, P., Hu, Q., Wang, Q., Bo, L., Lyu, S.: Detection, track- ing, and counting meets drones in crowds: A benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7812–7821 (June 2021)
2021
-
[62]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Wu, Y., Wang, X., Yang, X., Liu, M., Zeng, D., Ye, H., Li, S.: Learning occlusion- robust vision transformers for real-time uav tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17103–17113 (June 2025)
2025
-
[63]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Xie, F., Wang, Z., Ma, C.: Diffusiontrack: Point set diffusion model for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19113–19124 (June 2024)
2024
-
[64]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Xie, J., Zhong, B., Mo, Z., Zhang, S., Shi, L., Song, S., Ji, R.: Autoregressive queries for adaptive tracking with spatio-temporal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19300–19309 (June 2024)
2024
-
[65]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Xiong, Y., Zhou, C., Xiang, X., Wu, L., Zhu, C., Liu, Z., Suri, S., Varadarajan, B., Akula, R., Iandola, F., Krishnamoorthi, R., Soran, B., Chandra, V.: Efficient track anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11513–11524 (October 2025)
2025
-
[66]
arXiv preprint arXiv:2507.21732 (2025)
Xu, Q., Zhu, L., Liu, C., Lin, G., Long, C., Li, Z., Zhao, R.: Samite: Position prompted sam2 with calibrated memory for visual object tracking. arXiv preprint arXiv:2507.21732 (2025)
arXiv 2025
-
[67]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Xue, C., Zhong, B., Liang, Q., Zheng, Y., Li, N., Xue, Y., Song, S.: Similarity- guided layer-adaptive vision transformer for uav tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6730–6740 (June 2025)
2025
-
[68]
IEEE Transactions on Multime- dia26, 6228–6237 (2024).https://doi.org/10.1109/TMM.2023.3347644
Xun, Z., Di, S., Gao, Y., Tang, Z., Wang, G., Liu, S., Li, B.: Linker: Learning long short-term associations for robust visual tracking. IEEE Transactions on Multime- dia26, 6228–6237 (2024).https://doi.org/10.1109/TMM.2023.3347644
-
[69]
In: Proceedings of the IEEE/CVF international conference on computer vision
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10448–10457 (2021)
2021
-
[70]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yan, B., Zhang, X., Wang, D., Lu, H., Yang, X.: Alpha-refine: Boosting track- ing performance by precise bounding box estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5289–5298 (June 2021) 22 M. Alansari, Y. Michael et al
2021
-
[71]
Yang, C.Y., Huang, H.W., Chai, W., Jiang, Z., Hwang, J.N.: Samurai: Motion- aware memory for training-free visual object tracking with sam 2. IEEE Transac- tions on Image Processing35, 970–982 (2026).https://doi.org/10.1109/TIP. 2026.3651835
work page doi:10.1109/tip 2026
-
[72]
In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W
Yang, Z., Wei, Y., Yang, Y.: Associating objects with transformers for video object segmentation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 2491–
-
[73]
(2021),https://proceedings.neurips.cc/paper_ files/paper/2021/file/147702db07145348245dc5a2f2fe5683-Paper.pdf
Curran Associates, Inc. (2021),https://proceedings.neurips.cc/paper_ files/paper/2021/file/147702db07145348245dc5a2f2fe5683-Paper.pdf
2021
-
[75]
In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A
Yang, Z., Yang, Y.: Decoupling features in hierarchical propagation for video ob- ject segmentation. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 36324–36336. Curran Associates, Inc. (2022),https://proceedings.neurips.cc/ paper _ files / paper / 2022 / file /...
2022
-
[76]
In: European conference on computer vision
Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and rela- tion modeling for tracking: A one-stream framework. In: European conference on computer vision. pp. 341–357. Springer (2022)
2022
-
[77]
In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M
Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W.: Ocean: Object-aware anchor-free track- ing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 771–787. Springer International Publishing, Cham (2020)
2020
-
[78]
Zheng, Y., Zhong, B., Liang, Q., Mo, Z., Zhang, S., Li, X.: Odtrack: Online dense temporal token learning for visual tracking. Proceedings of the AAAI Confer- ence on Artificial Intelligence38(7), 7588–7596 (Mar 2024).https://doi.org/ 10.1609/aaai.v38i7.28591,https://ojs.aaai.org/index.php/AAAI/article/ view/28591
-
[79]
non-negligible distractor
Zhou, J., Pang, Z., Wang, Y.X.: Rmem: Restricted memory banks improve video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18602–18611 (June 2024) SENTRY 23 Appendix A Additional Ablation Study ...................................... 23 B Memory-Write Diagnostic Analysis .................
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.