pith. sign in

arxiv: 2605.22538 · v1 · pith:IE3JCUQVnew · submitted 2026-05-21 · 💻 cs.CV

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

Pith reviewed 2026-05-22 06:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual object trackingSAM 2nonlinear motion predictionsemantic adaptationgeometric constraintsanti-UAV trackingfoundation model adaptationvideo object segmentation
0
0 comments X

The pith

Adapting SAM 2 with explicit motion prediction, semantic shift detection, and geometric constraints creates a tracker that handles nonlinear motion and distractors more reliably than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that SAM 2's pretrained video understanding can be turned into an effective visual object tracker for difficult cases by adding targeted adaptations for motion dynamics, geometric structure, and semantic consistency. Traditional supervised trackers struggle with generalization to new objects and scenarios involving occlusion or nonlinear paths, while direct use of SAM 2 falls short because it lacks explicit modeling of target movement and cross-frame constraints. SAMOSA adds a lightweight nonlinear motion predictor to forecast dynamics and steer mask selection plus memory updates, employs semantic cues to identify target changes and recover from failures, and applies geometric cues as stability constraints. A reader would care because the result points toward trackers that work across unseen conditions without task-specific retraining, potentially simplifying deployment in surveillance, UAV monitoring, and similar domains.

Core claim

SAMOSA adapts SAM 2 to complex visual object tracking by explicitly leveraging motion, geometry, and semantic cues. A lightweight nonlinear motion predictor models target dynamics to guide mask selection and memory filtering. Semantic cues detect target shifts and aid recovery from tracking failures, while geometric cues act as structural constraints for improved stability. This combination bridges the implicit video priors of SAM 2 with explicit tracking needs, yielding consistent outperformance over state-of-the-art SAM 2-based methods on general benchmarks, stronger generalization than supervised VOT approaches, and notable gains on anti-UAV datasets that feature complex nonlinear motion.

What carries the argument

Lightweight nonlinear motion predictor that models target dynamics to guide mask selection and memory filtering, augmented by semantic shift detection and geometric structural constraints.

If this is right

  • SAMOSA consistently outperforms state-of-the-art SAM 2-based approaches on general benchmarks.
  • It demonstrates stronger generalization than supervised VOT methods across unseen objects and scenarios.
  • It achieves substantial performance gains on anti-UAV datasets that typify complex nonlinear motion.
  • The framework bridges implicit video understanding priors with explicit tracking-oriented modeling for more stable results under occlusion and distractors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptation pattern of adding lightweight motion and consistency modules could apply to other video foundation models for tasks like action detection or video segmentation.
  • Because the motion predictor is described as lightweight, the approach may support real-time use on edge devices such as drones without heavy compute overhead.
  • If semantic shift detection proves robust, it could reduce reliance on frequent manual reinitialization in long-term tracking deployments.

Load-bearing premise

The lightweight nonlinear motion predictor combined with semantic shift detection and geometric constraints will reliably guide mask selection and memory filtering without introducing new failure modes or requiring extensive per-scenario tuning.

What would settle it

On anti-UAV or similar nonlinear motion test sets, if versions of SAMOSA without the motion predictor, semantic detection, or geometric constraints match or exceed the full model's accuracy, the claim that these adaptations are essential would be challenged.

Figures

Figures reproduced from arXiv: 2605.22538 by Bingyao Yu, Deyi Zhu, Jie Zhou, Jiwen Lu, Yansong Tang, Yong Liu, Yuji Wang.

Figure 1
Figure 1. Figure 1: Performance comparison on linear and nonlinear motion scenarios [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of the roles of motion, geometry, and semantic cues in complex visual object tracking. Frames are cropped for clarity. (a) Motion cues help track small objects moving in cluttered backgrounds. (b) Geometry cues help prevent interference from similar distractors nearby. (c) Semantic cues utilize latent feature to help identify and prevent target shift errors. motion refers to motion involving veloc… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Overall pipeline of SAMOSA, which integrates the proposed MP, EDRM, and TAMB modules into the SAM 2 backbone. (b) The MP is trained independently from SAM 2 and videos. After training, it is directly plugged into SAM 2 for inference. (c) The framework of TAMB, consisting of a memory filtering stage and a top-k selection process. where fθ parameterizes the non-linear state transition, and set, set−1, . … view at source ↗
Figure 4
Figure 4. Figure 4: The framework of Error Detection-Recovery Module. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison of different MP backbones under complex nonlinear [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison results. Ground truth bounding boxes are marked in red. Masks and bounding boxes predicted by methods are marked in green. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SAMOSA, a framework adapting SAM 2 for visual object tracking in complex scenarios. It introduces a lightweight nonlinear motion predictor to model target dynamics and guide mask selection plus memory filtering, semantic cues to detect shifts and recover from failures, and geometric cues as structural constraints. The central claim is that this explicit integration of motion, geometry, and semantics bridges SAM 2's implicit video priors with tracking needs, yielding consistent outperformance over SAM 2-based methods on general benchmarks, stronger generalization than supervised VOT approaches, and substantial gains on anti-UAV datasets typifying nonlinear motion.

Significance. If the empirical results hold under rigorous validation, the work could meaningfully advance integration of large-scale pretrained vision models with explicit dynamics modeling for robust tracking. The code release aids reproducibility. Strengths include the focus on generalization to unseen objects and challenging conditions without task-specific retraining, though the significance hinges on demonstrating that added modules do not introduce instability.

major comments (2)
  1. [Method (motion predictor and cue integration)] The central claim depends on the lightweight nonlinear motion predictor reliably steering SAM 2 mask selection and memory filtering without new failure modes under strong nonlinearity or distractors. The manuscript describes the predictor as data-guided but provides insufficient detail on its architecture, training procedure, interaction with the memory bank, or ablation isolating its contribution versus semantic/geometric modules; this leaves the assumption that it acts as a stable constraint untested in the reported experiments.
  2. [Experiments (anti-UAV and general benchmarks)] Table or figure reporting anti-UAV results: the substantial gains are asserted but without statistical significance testing, variance across runs, or direct comparison to the strongest SAM 2 baselines under identical conditions, the evidence supporting outperformance on nonlinear scenarios remains moderate and does not yet fully substantiate the generalization claim over supervised VOT methods.
minor comments (2)
  1. [Abstract] The abstract states 'extensive experiments' but could include one or two key quantitative metrics (e.g., success rate deltas) to better preview the gains.
  2. [Method] Notation for cue integration weights and memory filtering thresholds should be defined explicitly in the method section to improve clarity for readers reproducing the framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and commit to revisions that improve clarity and rigor without altering the core contributions of SAMOSA.

read point-by-point responses
  1. Referee: [Method (motion predictor and cue integration)] The central claim depends on the lightweight nonlinear motion predictor reliably steering SAM 2 mask selection and memory filtering without new failure modes under strong nonlinearity or distractors. The manuscript describes the predictor as data-guided but provides insufficient detail on its architecture, training procedure, interaction with the memory bank, or ablation isolating its contribution versus semantic/geometric modules; this leaves the assumption that it acts as a stable constraint untested in the reported experiments.

    Authors: We agree that additional technical detail on the nonlinear motion predictor would strengthen the manuscript and better substantiate its role as a stable constraint. In the revised version we will expand the method section to specify the predictor's architecture (including layer types, input features from SAM 2 embeddings, and output parameterization), the training procedure (datasets, loss functions, and optimization details), and its precise interaction with the memory bank for mask selection and filtering. We will also add a dedicated ablation that isolates the motion predictor's contribution from the semantic and geometric modules, with qualitative analysis of failure modes under strong nonlinearity and distractors. These changes directly address the concern while preserving the lightweight design. revision: yes

  2. Referee: [Experiments (anti-UAV and general benchmarks)] Table or figure reporting anti-UAV results: the substantial gains are asserted but without statistical significance testing, variance across runs, or direct comparison to the strongest SAM 2 baselines under identical conditions, the evidence supporting outperformance on nonlinear scenarios remains moderate and does not yet fully substantiate the generalization claim over supervised VOT methods.

    Authors: We acknowledge that the current experimental presentation would benefit from greater statistical rigor. In the revision we will augment the anti-UAV and general benchmark results with statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values), report standard deviations or variance across multiple runs with different seeds where computationally feasible, and ensure all SAM 2 baselines are re-evaluated under identical conditions and hyper-parameters for direct comparison. These additions will provide stronger quantitative support for the claimed gains on nonlinear motion scenarios and the generalization advantage over supervised VOT methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical integration of motion, semantic, and geometric modules with SAM 2

full rationale

The paper presents SAMOSA as a practical adaptation framework that combines a lightweight nonlinear motion predictor, semantic shift detection, and geometric constraints to guide SAM 2's mask selection and memory filtering. No equations, derivations, or parameter-fitting steps are described that reduce by construction to the inputs or to self-citations. The central claims rest on experimental benchmark results rather than any mathematical structure that loops back to fitted values or prior author work. This is a standard empirical method paper whose load-bearing elements are externally falsifiable via the reported tracking performance on general and anti-UAV datasets.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that SAM 2 already encodes useful video priors that can be steered by explicit cues, plus likely free parameters for cue weighting and predictor hyperparameters that are tuned on validation data.

free parameters (1)
  • cue integration weights and motion predictor hyperparameters
    These control how motion, geometry, and semantic signals influence mask selection and memory; they are introduced to make the adaptation work and are expected to be fitted or chosen on held-out data.
axioms (1)
  • domain assumption SAM 2 learns strong video understanding priors from large-scale pretraining that can be adapted for tracking
    Invoked in the abstract as the foundation that the proposed modules build upon.

pith-pipeline@v0.9.0 · 5811 in / 1441 out tokens · 46850 ms · 2026-05-22T06:15:14.722939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 1 internal anchor

  1. [1]

    Fully-convolutional siamese networks for object tracking,

    L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, “Fully-convolutional siamese networks for object tracking,” inEur . Conf. Comput. Vis. Workshops, 2016, pp. 850–865

  2. [2]

    Siamrpn++: Evolution of siamese visual tracking with very deep networks,

    B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++: Evolution of siamese visual tracking with very deep networks,” inIEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 4277–4286

  3. [3]

    Siam r-cnn: Visual tracking by re-detection,

    P. V oigtlaender, J. Luiten, P. H. Torr, and B. Leibe, “Siam r-cnn: Visual tracking by re-detection,” inIEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 6578–6588

  4. [4]

    Siamon: Siamese occlusion-aware network for visual tracking,

    C. Fan, H. Yu, Y . Huang, C. Shan, L. Wang, and C. Li, “Siamon: Siamese occlusion-aware network for visual tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 33, no. 1, pp. 186–199, 2023

  5. [5]

    Siamthn: Siamese target highlight network for visual tracking,

    J. Bao, K. Chen, X. Sun, L. Zhao, W. Diao, and M. Yan, “Siamthn: Siamese target highlight network for visual tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 35, no. 7, pp. 7061–7074, 2025

  6. [6]

    Transformer tracking,

    X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” inIEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 8122– 8131

  7. [7]

    Joint feature learning and relation modeling for tracking: A one-stream framework,

    B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” inEur . Conf. Comput. Vis., 2022, pp. 341–357

  8. [8]

    Autoregressive visual tracking,

    X. Wei, Y . Bai, Y . Zheng, D. Shi, and Y . Gong, “Autoregressive visual tracking,” inIEEE Conf. Comput. Vis. Pattern Recog., June 2023, pp. 9697–9706

  9. [9]

    Artrackv2: Prompting autore- gressive tracker where to look and how to describe,

    Y . Bai, Z. Zhao, Y . Gong, and X. Wei, “Artrackv2: Prompting autore- gressive tracker where to look and how to describe,” inIEEE Conf. Comput. Vis. Pattern Recog., June 2024

  10. [10]

    Bidirectional interaction of cnn and transformer feature for visual tracking,

    B. Sun, Z. Wang, S. Wang, Y . Cheng, and J. Ning, “Bidirectional interaction of cnn and transformer feature for visual tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 8, pp. 7259–7271, 2024

  11. [11]

    Learning an adaptive and view-invariant vision transformer for real-time uav tracking,

    Y . Wu, Y . Li, M. Liu, X. Wang, X. Yang, H. Ye, D. Zeng, Q. Zhao, and S. Li, “Learning an adaptive and view-invariant vision transformer for real-time uav tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 36, no. 2, pp. 2403–2418, 2026

  12. [12]

    Got-10k: A large high-diversity benchmark for generic object tracking in the wild,

    L. Huang, X. Zhao, and K. Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 5, pp. 1562–1577, 2021

  13. [13]

    Track- ingnet: A large-scale dataset and benchmark for object tracking in the wild,

    M. M ¨uller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem, “Track- ingnet: A large-scale dataset and benchmark for object tracking in the wild,” inEur . Conf. Comput. Vis., 2018, pp. 310–327

  14. [14]

    Lasot: A high-quality benchmark for large-scale single object tracking,

    H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y . Xu, C. Liao, and H. Ling, “Lasot: A high-quality benchmark for large-scale single object tracking,” inIEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 5369–5378

  15. [15]

    Object tracking benchmark,

    Y . Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, 2015

  16. [16]

    Online learning samples and adaptive recovery for robust rgb-t tracking,

    J. Liu, Z. Luo, and X. Xiong, “Online learning samples and adaptive recovery for robust rgb-t tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 2, pp. 724–737, 2024

  17. [17]

    Top-down cross-modal guidance for robust rgb-t tracking,

    L. Chen, B. Zhong, Q. Liang, Y . Zheng, Z. Mo, and S. Song, “Top-down cross-modal guidance for robust rgb-t tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 12, pp. 12 388–12 398, 2024. 12

  18. [18]

    Siamcda: Complementarity- and distractor-aware rgb-t tracking based on siamese network,

    T. Zhang, X. Liu, Q. Zhang, and J. Han, “Siamcda: Complementarity- and distractor-aware rgb-t tracking based on siamese network,”IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 3, pp. 1403–1417, 2022

  19. [19]

    Mambavt: Spatio-temporal contextual modeling for robust rgb-t tracking,

    S. Lai, C. Liu, J. Zhu, B. Kang, Y . Liu, D. Wang, and H. Lu, “Mambavt: Spatio-temporal contextual modeling for robust rgb-t tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 35, no. 9, pp. 9312–9323, 2025

  20. [20]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inInt. Conf. Mach. Learn., vol. 139, 2021, pp. 8748–8763

  21. [21]

    Emerging properties in self-supervised vision transformers,

    M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inInt. Conf. Comput. Vis., 2021, pp. 9630–9640

  22. [22]

    DINOv2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features withou...

  23. [23]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haz- iza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´egou, P. Labatut, and P. Bojanowski, “DINOv3,”arXiv:2508.10104, 2025

  24. [24]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inInt. Conf. Comput. Vis., 2023, pp. 4015–4026

  25. [25]

    Sam 2: Segment anything in images and videos,

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Dollar, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,” inInt. Conf. Learn. Represent., 2025

  26. [26]

    Sam 3: Segment anything with concepts,

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. S. Coll-Vinent, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. R ¨adle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y . Zhou, L. Momeni, R. HAZRA, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollar, N. R...

  27. [27]

    Available: https://openreview.net/forum?id=r35clVtGzw

    [Online]. Available: https://openreview.net/forum?id=r35clVtGzw

  28. [28]

    Anti-uav: A large-scale benchmark for vision-based uav tracking,

    N. Jiang, K. Wang, X. Peng, X. Yu, Q. Wang, J. Xing, G. Li, G. Guo, Q. Ye, J. Jiaoet al., “Anti-uav: A large-scale benchmark for vision-based uav tracking,”IEEE Trans. Image Process., vol. 25, pp. 486–500, 2021

  29. [29]

    Open-vocabulary segmentation with semantic-assisted calibration,

    Y . Liu, S. Bai, G. Li, Y . Wang, and Y . Tang, “Open-vocabulary segmentation with semantic-assisted calibration,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 3491–3500

  30. [30]

    Stepping out of similar semantic space for open-vocabulary segmentation,

    Y . Liu, S.-L. Wu, S. Bai, J. Wang, Y . Wang, and Y . Tang, “Stepping out of similar semantic space for open-vocabulary segmentation,” inInt. Conf. Comput. Vis., October 2025, pp. 22 664–22 674

  31. [31]

    Self- calibrated clip for training-free open-vocabulary segmentation,

    S. Bai, Y . Liu, Y . Han, H. Zhang, Y . Tang, J. Zhou, and J. Lu, “Self- calibrated clip for training-free open-vocabulary segmentation,”IEEE Trans. Image Process., vol. 34, pp. 8271–8284, 2025

  32. [32]

    Learning high-quality dynamic memory for video object segmentation,

    Y . Liu, R. Yu, F. Yin, X. Zhao, W. Zhao, W. Xia, J. Wang, Y . Wang, Y . Tang, and Y . Yang, “Learning high-quality dynamic memory for video object segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 5, pp. 3452–3468, 2025

  33. [33]

    Video decoupling networks for accurate, efficient, generalizable, and robust video object segmentation,

    J. Dang, H. Zheng, Y . Guo, J. Lai, B. Hu, and T.-S. Chua, “Video decoupling networks for accurate, efficient, generalizable, and robust video object segmentation,”IEEE Trans. Image Process., vol. 35, pp. 1218–1230, 2026

  34. [34]

    Region aware video object segmentation with deep motion modeling,

    B. Miao, M. Bennamoun, Y . Gao, and A. Mian, “Region aware video object segmentation with deep motion modeling,”IEEE Trans. Image Process., vol. 33, pp. 2639–2651, 2024

  35. [35]

    Lavt: Language-aware vision transformer for referring image segmentation,

    Z. Yang, J. Wang, Y . Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 155–18 165

  36. [36]

    Samu- rai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory,

    C.-Y . Yang, H.-W. Huang, W. Chai, Z. Jiang, and J.-N. Hwang, “Samu- rai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory,”arXiv:2411.11922, 2024

  37. [37]

    A distractor-aware memory for visual object tracking with SAM2,

    J. Videnovic, A. Lukezic, and M. Kristan, “A distractor-aware memory for visual object tracking with SAM2,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 24 255–24 264

  38. [38]

    Samite: Position prompted sam2 with calibrated memory for visual object tracking,

    Q. Xu, L. Zhu, C. Liu, G. Lin, C. Long, Z. Li, and R. Zhao, “Samite: Position prompted sam2 with calibrated memory for visual object tracking,”arXiv:2507.21732, 2025

  39. [39]

    Him2sam: Enhancing SAM2 with hierarchical motion estimation and memory optimization towards long-term tracking,

    R. Chen, G. Sun, Y . Li, J. Qin, and L. Benini, “Him2sam: Enhancing SAM2 with hierarchical motion estimation and memory optimization towards long-term tracking,” inPattern Recognition and Computer Vision, 2025, pp. 276–291

  40. [40]

    Camouflaged instance segmentation in- the-wild: Dataset, method, and benchmark suite,

    T.-N. Le, Y . Cao, T.-C. Nguyen, M.-Q. Le, K.-D. Nguyen, T.-T. Do, M.-T. Tran, and T. V . Nguyen, “Camouflaged instance segmentation in- the-wild: Dataset, method, and benchmark suite,”IEEE Trans. Image Process., vol. 31, pp. 287–300, 2022

  41. [41]

    Sam-pm: Enhancing video camouflaged object detection using spatio-temporal attention,

    M. N. Meeran, G. A. T, and B. P. Mantha, “Sam-pm: Enhancing video camouflaged object detection using spatio-temporal attention,” inIEEE Conf. Comput. Vis. Pattern Recog. Worksh., June 2024, pp. 1857–1866

  42. [42]

    Zoomnext: A unified collaborative pyramid network for camouflaged object detection,

    Y . Pang, X. Zhao, T.-Z. Xiang, L. Zhang, and H. Lu, “Zoomnext: A unified collaborative pyramid network for camouflaged object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, pp. 9205–9220, 2024

  43. [43]

    Sam2-love: Segment anything model 2 in language-aided audio-visual scenes,

    Y . Wang, H. Xu, Y . Liu, J. Li, and Y . Tang, “Sam2-love: Segment anything model 2 in language-aided audio-visual scenes,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 28 932–28 941

  44. [44]

    Ddavs: Disentangled audio semantics and delayed bidi- rectional alignment for audio-visual segmentation,

    J. Tian, Y . Du, H. Zhang, Y . Wang, I. N. Lee, X. Bai, T. Zhu, J. Niu, and Y . Tang, “Ddavs: Disentangled audio semantics and delayed bidi- rectional alignment for audio-visual segmentation,”arXiv:2512.20117, 2025

  45. [45]

    Contrastive conditional latent diffusion for audio-visual segmentation,

    Y . Mao, J. Zhang, M. Xiang, Y . Lv, D. Li, Y . Zhong, and Y . Dai, “Contrastive conditional latent diffusion for audio-visual segmentation,” IEEE Trans. Image Process., vol. 34, pp. 4108–4119, 2025

  46. [46]

    Actor and action modular network for text-based video segmentation,

    J. Yang, Y . Huang, K. Niu, L. Huang, Z. Ma, and L. Wang, “Actor and action modular network for text-based video segmentation,”IEEE Trans. Image Process., vol. 31, pp. 4474–4489, 2022

  47. [47]

    Semantic-assisted object clustering for multi-modal referring video segmentation,

    Y . Liu, Z. Luo, Y . Xiao, Y . Wang, S. Li, X. Li, Y . Yang, and Y . Tang, “Semantic-assisted object clustering for multi-modal referring video segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, no. 1, pp. 572–590, 2026

  48. [48]

    Language-aware vision transformer for referring segmentation,

    Z. Yang, J. Wang, X. Ye, Y . Tang, K. Chen, H. Zhao, and P. H. S. Torr, “Language-aware vision transformer for referring segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 7, pp. 5238–5255, 2025

  49. [49]

    Iterprime: zero-shot referring image segmentation with iterative grad-cam refinement and primary word emphasis,

    Y . Wang, J. Ni, Y . Liu, C. Yuan, and Y . Tang, “Iterprime: zero-shot referring image segmentation with iterative grad-cam refinement and primary word emphasis,” inAAAI, 2025, pp. 8159–8168

  50. [50]

    Vg-refiner: Towards tool-refined referring grounded reasoning via agentic reinforcement learning,

    Y . Wang, W. Liu, J. Niu, H. Zhang, and Y . Tang, “Vg-refiner: Towards tool-refined referring grounded reasoning via agentic reinforcement learning,”arXiv:2512.06373, 2025

  51. [51]

    Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree,

    S. Ding, R. Qian, X. Dong, P. Zhang, Y . Zang, Y . Cao, Y . Guo, D. Lin, and J. Wang, “Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree,” inInt. Conf. Comput. Vis., 2025, pp. 13 614–13 624

  52. [52]

    Advancing complex video object segmentation via progressive concept construction,

    Z. Zhang, S. Ding, X. Dong, S. He, J. Lin, J. Tang, Y . Zang, Y . Cao, D. Lin, and J. Wang, “Advancing complex video object segmentation via progressive concept construction,” inInt. Conf. Learn. Represent., 2026. [Online]. Available: https://openreview.net/forum?id=hDM3YphhVx

  53. [53]

    A new approach to linear filtering and prediction problems,

    R. E. Kalman, “A new approach to linear filtering and prediction problems,”Journal of Basic Engineering, vol. 82, no. 1, pp. 35–45, 1960

  54. [54]

    Lasot: A high- quality large-scale single object tracking benchmark,

    H. Fan, H. Bai, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, Harshit, M. Huang, J. Liu, Y . Xu, C. Liao, L. Yuan, and H. Ling, “Lasot: A high- quality large-scale single object tracking benchmark,”Int. J. Comput. Vis., vol. 129, no. 2, p. 439–461, 2021

  55. [55]

    Anti-uav410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,

    B. Huang, J. Li, J. Chen, G. Wang, J. Zhao, and T. Xu, “Anti-uav410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 5, pp. 2852–2865, 2023

  56. [56]

    Evidential detection and tracking collaboration: New problem, benchmark and algorithm for robust anti-uav system

    X.-F. Zhu, T. Xu, J. Zhao, J.-W. Liu, K. Wang, G. Wang, J. Li, Q. Wang, L. Jin, Z. Zhu, J. Xing, and X.-J. Wu, “Evidential detection and tracking collaboration: New problem, benchmark and algorithm for robust anti- uav system,”arXiv:2306.15767, 2023

  57. [57]

    Vision-based anti-uav detection and tracking,

    J. Zhao, J. Zhang, D. Li, and D. Wang, “Vision-based anti-uav detection and tracking,”IEEE Trans. Intell. Transp. Syst., vol. 23, no. 12, pp. 25 323–25 334, 2022

  58. [58]

    Visual object tracking using adaptive correlation filters,

    D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y . M. Lui, “Visual object tracking using adaptive correlation filters,” inIEEE Conf. Comput. Vis. Pattern Recog., 2010, pp. 2544–2550

  59. [59]

    High-speed tracking with kernelized correlation filters,

    J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, 2015

  60. [60]

    Discrimina- tive correlation filter with channel and spatial reliability,

    A. Luke ˇzic, T. V oj´ır, L. C. Zajc, J. Matas, and M. Kristan, “Discrimina- tive correlation filter with channel and spatial reliability,” inIEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 4847–4856. 13

  61. [61]

    Learning discriminative model prediction for tracking,

    G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” inInt. Conf. Comput. Vis., 2019, pp. 6181–6190

  62. [62]

    Tracking meets lora: Faster training, larger model, stronger performance,

    L. Lin, H. Fan, Z. Zhang, Y . Wang, Y . Xu, and H. Ling, “Tracking meets lora: Faster training, larger model, stronger performance,” inEur . Conf. Comput. Vis., 2024, p. 300–318

  63. [63]

    Odtrack: Online dense temporal token learning for visual tracking,

    Y . Zheng, B. Zhong, Q. Liang, Z. Mo, S. Zhang, and X. Li, “Odtrack: Online dense temporal token learning for visual tracking,” inAAAI, 2024, pp. 7588–7596

  64. [64]

    Hiera: A hierarchical vision transformer without the bells-and-whistles,

    C. Ryali, Y .-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y . Huang, V . Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffmanet al., “Hiera: A hierarchical vision transformer without the bells-and-whistles,” inInt. Conf. Mach. Learn., 2023, pp. 29 441–29 454

  65. [65]

    Distance-iou loss: Faster and better learning for bounding box regression,

    Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss: Faster and better learning for bounding box regression,” inAAAI, vol. 34, no. 07, 2020, pp. 12 993–13 000

  66. [66]

    Seqtrack: Sequence to sequence learning for visual object tracking,

    X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 14 572–14 581

  67. [67]

    Robust object modeling for visual tracking,

    Y . Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” inInt. Conf. Comput. Vis., 2023, pp. 9555–9566

  68. [68]

    Hiptrack: Visual tracking with historical prompts,

    W. Cai, Q. Liu, and Y . Wang, “Hiptrack: Visual tracking with historical prompts,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 19 258– 19 267

  69. [69]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

  70. [70]

    Generalized intersection over union: A metric and a loss for bounding box regression,

    H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inIEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 658–666. Deyi Zhureceived the B.S. degree from the De- partment of Automation, Tsinghua University, in

  71. [71]

    His current research interests include computer vision and embodied intelligence

    He is currently pursuing the Ph.D degree with Tsinghua Shenzhen International Graduate School, Tsinghua University. His current research interests include computer vision and embodied intelligence. Yuji Wangreceived the B.S. degree in Electric and Electronic Engineering from the University of Elec- tronic Science and Technology of China (UESTC) in

  72. [72]

    Yansong Tang

    He is currently a second-year master student with the Shenzhen International Graduate School, Tsinghua University, supervised by Prof. Yansong Tang. His research interests focus on computer vi- sion, including vision-language models, tool-calling, multimodal learning, image/video segmentation and tracking. He has published papers in top conferences such a...