Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking
Pith reviewed 2026-05-22 06:15 UTC · model grok-4.3
The pith
Adapting SAM 2 with explicit motion prediction, semantic shift detection, and geometric constraints creates a tracker that handles nonlinear motion and distractors more reliably than prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SAMOSA adapts SAM 2 to complex visual object tracking by explicitly leveraging motion, geometry, and semantic cues. A lightweight nonlinear motion predictor models target dynamics to guide mask selection and memory filtering. Semantic cues detect target shifts and aid recovery from tracking failures, while geometric cues act as structural constraints for improved stability. This combination bridges the implicit video priors of SAM 2 with explicit tracking needs, yielding consistent outperformance over state-of-the-art SAM 2-based methods on general benchmarks, stronger generalization than supervised VOT approaches, and notable gains on anti-UAV datasets that feature complex nonlinear motion.
What carries the argument
Lightweight nonlinear motion predictor that models target dynamics to guide mask selection and memory filtering, augmented by semantic shift detection and geometric structural constraints.
If this is right
- SAMOSA consistently outperforms state-of-the-art SAM 2-based approaches on general benchmarks.
- It demonstrates stronger generalization than supervised VOT methods across unseen objects and scenarios.
- It achieves substantial performance gains on anti-UAV datasets that typify complex nonlinear motion.
- The framework bridges implicit video understanding priors with explicit tracking-oriented modeling for more stable results under occlusion and distractors.
Where Pith is reading between the lines
- The same adaptation pattern of adding lightweight motion and consistency modules could apply to other video foundation models for tasks like action detection or video segmentation.
- Because the motion predictor is described as lightweight, the approach may support real-time use on edge devices such as drones without heavy compute overhead.
- If semantic shift detection proves robust, it could reduce reliance on frequent manual reinitialization in long-term tracking deployments.
Load-bearing premise
The lightweight nonlinear motion predictor combined with semantic shift detection and geometric constraints will reliably guide mask selection and memory filtering without introducing new failure modes or requiring extensive per-scenario tuning.
What would settle it
On anti-UAV or similar nonlinear motion test sets, if versions of SAMOSA without the motion predictor, semantic detection, or geometric constraints match or exceed the full model's accuracy, the claim that these adaptations are essential would be challenged.
Figures
read the original abstract
Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SAMOSA, a framework adapting SAM 2 for visual object tracking in complex scenarios. It introduces a lightweight nonlinear motion predictor to model target dynamics and guide mask selection plus memory filtering, semantic cues to detect shifts and recover from failures, and geometric cues as structural constraints. The central claim is that this explicit integration of motion, geometry, and semantics bridges SAM 2's implicit video priors with tracking needs, yielding consistent outperformance over SAM 2-based methods on general benchmarks, stronger generalization than supervised VOT approaches, and substantial gains on anti-UAV datasets typifying nonlinear motion.
Significance. If the empirical results hold under rigorous validation, the work could meaningfully advance integration of large-scale pretrained vision models with explicit dynamics modeling for robust tracking. The code release aids reproducibility. Strengths include the focus on generalization to unseen objects and challenging conditions without task-specific retraining, though the significance hinges on demonstrating that added modules do not introduce instability.
major comments (2)
- [Method (motion predictor and cue integration)] The central claim depends on the lightweight nonlinear motion predictor reliably steering SAM 2 mask selection and memory filtering without new failure modes under strong nonlinearity or distractors. The manuscript describes the predictor as data-guided but provides insufficient detail on its architecture, training procedure, interaction with the memory bank, or ablation isolating its contribution versus semantic/geometric modules; this leaves the assumption that it acts as a stable constraint untested in the reported experiments.
- [Experiments (anti-UAV and general benchmarks)] Table or figure reporting anti-UAV results: the substantial gains are asserted but without statistical significance testing, variance across runs, or direct comparison to the strongest SAM 2 baselines under identical conditions, the evidence supporting outperformance on nonlinear scenarios remains moderate and does not yet fully substantiate the generalization claim over supervised VOT methods.
minor comments (2)
- [Abstract] The abstract states 'extensive experiments' but could include one or two key quantitative metrics (e.g., success rate deltas) to better preview the gains.
- [Method] Notation for cue integration weights and memory filtering thresholds should be defined explicitly in the method section to improve clarity for readers reproducing the framework.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and commit to revisions that improve clarity and rigor without altering the core contributions of SAMOSA.
read point-by-point responses
-
Referee: [Method (motion predictor and cue integration)] The central claim depends on the lightweight nonlinear motion predictor reliably steering SAM 2 mask selection and memory filtering without new failure modes under strong nonlinearity or distractors. The manuscript describes the predictor as data-guided but provides insufficient detail on its architecture, training procedure, interaction with the memory bank, or ablation isolating its contribution versus semantic/geometric modules; this leaves the assumption that it acts as a stable constraint untested in the reported experiments.
Authors: We agree that additional technical detail on the nonlinear motion predictor would strengthen the manuscript and better substantiate its role as a stable constraint. In the revised version we will expand the method section to specify the predictor's architecture (including layer types, input features from SAM 2 embeddings, and output parameterization), the training procedure (datasets, loss functions, and optimization details), and its precise interaction with the memory bank for mask selection and filtering. We will also add a dedicated ablation that isolates the motion predictor's contribution from the semantic and geometric modules, with qualitative analysis of failure modes under strong nonlinearity and distractors. These changes directly address the concern while preserving the lightweight design. revision: yes
-
Referee: [Experiments (anti-UAV and general benchmarks)] Table or figure reporting anti-UAV results: the substantial gains are asserted but without statistical significance testing, variance across runs, or direct comparison to the strongest SAM 2 baselines under identical conditions, the evidence supporting outperformance on nonlinear scenarios remains moderate and does not yet fully substantiate the generalization claim over supervised VOT methods.
Authors: We acknowledge that the current experimental presentation would benefit from greater statistical rigor. In the revision we will augment the anti-UAV and general benchmark results with statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values), report standard deviations or variance across multiple runs with different seeds where computationally feasible, and ensure all SAM 2 baselines are re-evaluated under identical conditions and hyper-parameters for direct comparison. These additions will provide stronger quantitative support for the claimed gains on nonlinear motion scenarios and the generalization advantage over supervised VOT methods. revision: yes
Circularity Check
No significant circularity: empirical integration of motion, semantic, and geometric modules with SAM 2
full rationale
The paper presents SAMOSA as a practical adaptation framework that combines a lightweight nonlinear motion predictor, semantic shift detection, and geometric constraints to guide SAM 2's mask selection and memory filtering. No equations, derivations, or parameter-fitting steps are described that reduce by construction to the inputs or to self-citations. The central claims rest on experimental benchmark results rather than any mathematical structure that loops back to fitted values or prior author work. This is a standard empirical method paper whose load-bearing elements are externally falsifiable via the reported tracking performance on general and anti-UAV datasets.
Axiom & Free-Parameter Ledger
free parameters (1)
- cue integration weights and motion predictor hyperparameters
axioms (1)
- domain assumption SAM 2 learns strong video understanding priors from large-scale pretraining that can be adapted for tracking
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
lightweight nonlinear motion predictor ... higher-order Markov model ... mask selection ... Error Detection–Recovery Module (EDRM) ... Target-Aware Memory Bank (TAMB)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Jcost, geometric score S_g, motion score S_m, semantic cosine similarity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Fully-convolutional siamese networks for object tracking,
L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, “Fully-convolutional siamese networks for object tracking,” inEur . Conf. Comput. Vis. Workshops, 2016, pp. 850–865
work page 2016
-
[2]
Siamrpn++: Evolution of siamese visual tracking with very deep networks,
B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++: Evolution of siamese visual tracking with very deep networks,” inIEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 4277–4286
work page 2019
-
[3]
Siam r-cnn: Visual tracking by re-detection,
P. V oigtlaender, J. Luiten, P. H. Torr, and B. Leibe, “Siam r-cnn: Visual tracking by re-detection,” inIEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 6578–6588
work page 2020
-
[4]
Siamon: Siamese occlusion-aware network for visual tracking,
C. Fan, H. Yu, Y . Huang, C. Shan, L. Wang, and C. Li, “Siamon: Siamese occlusion-aware network for visual tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 33, no. 1, pp. 186–199, 2023
work page 2023
-
[5]
Siamthn: Siamese target highlight network for visual tracking,
J. Bao, K. Chen, X. Sun, L. Zhao, W. Diao, and M. Yan, “Siamthn: Siamese target highlight network for visual tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 35, no. 7, pp. 7061–7074, 2025
work page 2025
-
[6]
X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” inIEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 8122– 8131
work page 2021
-
[7]
Joint feature learning and relation modeling for tracking: A one-stream framework,
B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” inEur . Conf. Comput. Vis., 2022, pp. 341–357
work page 2022
-
[8]
Autoregressive visual tracking,
X. Wei, Y . Bai, Y . Zheng, D. Shi, and Y . Gong, “Autoregressive visual tracking,” inIEEE Conf. Comput. Vis. Pattern Recog., June 2023, pp. 9697–9706
work page 2023
-
[9]
Artrackv2: Prompting autore- gressive tracker where to look and how to describe,
Y . Bai, Z. Zhao, Y . Gong, and X. Wei, “Artrackv2: Prompting autore- gressive tracker where to look and how to describe,” inIEEE Conf. Comput. Vis. Pattern Recog., June 2024
work page 2024
-
[10]
Bidirectional interaction of cnn and transformer feature for visual tracking,
B. Sun, Z. Wang, S. Wang, Y . Cheng, and J. Ning, “Bidirectional interaction of cnn and transformer feature for visual tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 8, pp. 7259–7271, 2024
work page 2024
-
[11]
Learning an adaptive and view-invariant vision transformer for real-time uav tracking,
Y . Wu, Y . Li, M. Liu, X. Wang, X. Yang, H. Ye, D. Zeng, Q. Zhao, and S. Li, “Learning an adaptive and view-invariant vision transformer for real-time uav tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 36, no. 2, pp. 2403–2418, 2026
work page 2026
-
[12]
Got-10k: A large high-diversity benchmark for generic object tracking in the wild,
L. Huang, X. Zhao, and K. Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 5, pp. 1562–1577, 2021
work page 2021
-
[13]
Track- ingnet: A large-scale dataset and benchmark for object tracking in the wild,
M. M ¨uller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem, “Track- ingnet: A large-scale dataset and benchmark for object tracking in the wild,” inEur . Conf. Comput. Vis., 2018, pp. 310–327
work page 2018
-
[14]
Lasot: A high-quality benchmark for large-scale single object tracking,
H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y . Xu, C. Liao, and H. Ling, “Lasot: A high-quality benchmark for large-scale single object tracking,” inIEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 5369–5378
work page 2019
-
[15]
Y . Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, 2015
work page 2015
-
[16]
Online learning samples and adaptive recovery for robust rgb-t tracking,
J. Liu, Z. Luo, and X. Xiong, “Online learning samples and adaptive recovery for robust rgb-t tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 2, pp. 724–737, 2024
work page 2024
-
[17]
Top-down cross-modal guidance for robust rgb-t tracking,
L. Chen, B. Zhong, Q. Liang, Y . Zheng, Z. Mo, and S. Song, “Top-down cross-modal guidance for robust rgb-t tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 12, pp. 12 388–12 398, 2024. 12
work page 2024
-
[18]
Siamcda: Complementarity- and distractor-aware rgb-t tracking based on siamese network,
T. Zhang, X. Liu, Q. Zhang, and J. Han, “Siamcda: Complementarity- and distractor-aware rgb-t tracking based on siamese network,”IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 3, pp. 1403–1417, 2022
work page 2022
-
[19]
Mambavt: Spatio-temporal contextual modeling for robust rgb-t tracking,
S. Lai, C. Liu, J. Zhu, B. Kang, Y . Liu, D. Wang, and H. Lu, “Mambavt: Spatio-temporal contextual modeling for robust rgb-t tracking,”IEEE Trans. Circuit Syst. Video Technol., vol. 35, no. 9, pp. 9312–9323, 2025
work page 2025
-
[20]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inInt. Conf. Mach. Learn., vol. 139, 2021, pp. 8748–8763
work page 2021
-
[21]
Emerging properties in self-supervised vision transformers,
M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inInt. Conf. Comput. Vis., 2021, pp. 9630–9640
work page 2021
-
[22]
DINOv2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features withou...
work page 2024
-
[23]
O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haz- iza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´egou, P. Labatut, and P. Bojanowski, “DINOv3,”arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inInt. Conf. Comput. Vis., 2023, pp. 4015–4026
work page 2023
-
[25]
Sam 2: Segment anything in images and videos,
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Dollar, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,” inInt. Conf. Learn. Represent., 2025
work page 2025
-
[26]
Sam 3: Segment anything with concepts,
N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. S. Coll-Vinent, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. R ¨adle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y . Zhou, L. Momeni, R. HAZRA, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollar, N. R...
-
[27]
Available: https://openreview.net/forum?id=r35clVtGzw
[Online]. Available: https://openreview.net/forum?id=r35clVtGzw
-
[28]
Anti-uav: A large-scale benchmark for vision-based uav tracking,
N. Jiang, K. Wang, X. Peng, X. Yu, Q. Wang, J. Xing, G. Li, G. Guo, Q. Ye, J. Jiaoet al., “Anti-uav: A large-scale benchmark for vision-based uav tracking,”IEEE Trans. Image Process., vol. 25, pp. 486–500, 2021
work page 2021
-
[29]
Open-vocabulary segmentation with semantic-assisted calibration,
Y . Liu, S. Bai, G. Li, Y . Wang, and Y . Tang, “Open-vocabulary segmentation with semantic-assisted calibration,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 3491–3500
work page 2024
-
[30]
Stepping out of similar semantic space for open-vocabulary segmentation,
Y . Liu, S.-L. Wu, S. Bai, J. Wang, Y . Wang, and Y . Tang, “Stepping out of similar semantic space for open-vocabulary segmentation,” inInt. Conf. Comput. Vis., October 2025, pp. 22 664–22 674
work page 2025
-
[31]
Self- calibrated clip for training-free open-vocabulary segmentation,
S. Bai, Y . Liu, Y . Han, H. Zhang, Y . Tang, J. Zhou, and J. Lu, “Self- calibrated clip for training-free open-vocabulary segmentation,”IEEE Trans. Image Process., vol. 34, pp. 8271–8284, 2025
work page 2025
-
[32]
Learning high-quality dynamic memory for video object segmentation,
Y . Liu, R. Yu, F. Yin, X. Zhao, W. Zhao, W. Xia, J. Wang, Y . Wang, Y . Tang, and Y . Yang, “Learning high-quality dynamic memory for video object segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 5, pp. 3452–3468, 2025
work page 2025
-
[33]
J. Dang, H. Zheng, Y . Guo, J. Lai, B. Hu, and T.-S. Chua, “Video decoupling networks for accurate, efficient, generalizable, and robust video object segmentation,”IEEE Trans. Image Process., vol. 35, pp. 1218–1230, 2026
work page 2026
-
[34]
Region aware video object segmentation with deep motion modeling,
B. Miao, M. Bennamoun, Y . Gao, and A. Mian, “Region aware video object segmentation with deep motion modeling,”IEEE Trans. Image Process., vol. 33, pp. 2639–2651, 2024
work page 2024
-
[35]
Lavt: Language-aware vision transformer for referring image segmentation,
Z. Yang, J. Wang, Y . Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 155–18 165
work page 2022
-
[36]
Samu- rai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory,
C.-Y . Yang, H.-W. Huang, W. Chai, Z. Jiang, and J.-N. Hwang, “Samu- rai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory,”arXiv:2411.11922, 2024
-
[37]
A distractor-aware memory for visual object tracking with SAM2,
J. Videnovic, A. Lukezic, and M. Kristan, “A distractor-aware memory for visual object tracking with SAM2,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 24 255–24 264
work page 2025
-
[38]
Samite: Position prompted sam2 with calibrated memory for visual object tracking,
Q. Xu, L. Zhu, C. Liu, G. Lin, C. Long, Z. Li, and R. Zhao, “Samite: Position prompted sam2 with calibrated memory for visual object tracking,”arXiv:2507.21732, 2025
-
[39]
R. Chen, G. Sun, Y . Li, J. Qin, and L. Benini, “Him2sam: Enhancing SAM2 with hierarchical motion estimation and memory optimization towards long-term tracking,” inPattern Recognition and Computer Vision, 2025, pp. 276–291
work page 2025
-
[40]
Camouflaged instance segmentation in- the-wild: Dataset, method, and benchmark suite,
T.-N. Le, Y . Cao, T.-C. Nguyen, M.-Q. Le, K.-D. Nguyen, T.-T. Do, M.-T. Tran, and T. V . Nguyen, “Camouflaged instance segmentation in- the-wild: Dataset, method, and benchmark suite,”IEEE Trans. Image Process., vol. 31, pp. 287–300, 2022
work page 2022
-
[41]
Sam-pm: Enhancing video camouflaged object detection using spatio-temporal attention,
M. N. Meeran, G. A. T, and B. P. Mantha, “Sam-pm: Enhancing video camouflaged object detection using spatio-temporal attention,” inIEEE Conf. Comput. Vis. Pattern Recog. Worksh., June 2024, pp. 1857–1866
work page 2024
-
[42]
Zoomnext: A unified collaborative pyramid network for camouflaged object detection,
Y . Pang, X. Zhao, T.-Z. Xiang, L. Zhang, and H. Lu, “Zoomnext: A unified collaborative pyramid network for camouflaged object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, pp. 9205–9220, 2024
work page 2024
-
[43]
Sam2-love: Segment anything model 2 in language-aided audio-visual scenes,
Y . Wang, H. Xu, Y . Liu, J. Li, and Y . Tang, “Sam2-love: Segment anything model 2 in language-aided audio-visual scenes,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 28 932–28 941
work page 2025
-
[44]
J. Tian, Y . Du, H. Zhang, Y . Wang, I. N. Lee, X. Bai, T. Zhu, J. Niu, and Y . Tang, “Ddavs: Disentangled audio semantics and delayed bidi- rectional alignment for audio-visual segmentation,”arXiv:2512.20117, 2025
-
[45]
Contrastive conditional latent diffusion for audio-visual segmentation,
Y . Mao, J. Zhang, M. Xiang, Y . Lv, D. Li, Y . Zhong, and Y . Dai, “Contrastive conditional latent diffusion for audio-visual segmentation,” IEEE Trans. Image Process., vol. 34, pp. 4108–4119, 2025
work page 2025
-
[46]
Actor and action modular network for text-based video segmentation,
J. Yang, Y . Huang, K. Niu, L. Huang, Z. Ma, and L. Wang, “Actor and action modular network for text-based video segmentation,”IEEE Trans. Image Process., vol. 31, pp. 4474–4489, 2022
work page 2022
-
[47]
Semantic-assisted object clustering for multi-modal referring video segmentation,
Y . Liu, Z. Luo, Y . Xiao, Y . Wang, S. Li, X. Li, Y . Yang, and Y . Tang, “Semantic-assisted object clustering for multi-modal referring video segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, no. 1, pp. 572–590, 2026
work page 2026
-
[48]
Language-aware vision transformer for referring segmentation,
Z. Yang, J. Wang, X. Ye, Y . Tang, K. Chen, H. Zhao, and P. H. S. Torr, “Language-aware vision transformer for referring segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 7, pp. 5238–5255, 2025
work page 2025
-
[49]
Y . Wang, J. Ni, Y . Liu, C. Yuan, and Y . Tang, “Iterprime: zero-shot referring image segmentation with iterative grad-cam refinement and primary word emphasis,” inAAAI, 2025, pp. 8159–8168
work page 2025
-
[50]
Vg-refiner: Towards tool-refined referring grounded reasoning via agentic reinforcement learning,
Y . Wang, W. Liu, J. Niu, H. Zhang, and Y . Tang, “Vg-refiner: Towards tool-refined referring grounded reasoning via agentic reinforcement learning,”arXiv:2512.06373, 2025
-
[51]
Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree,
S. Ding, R. Qian, X. Dong, P. Zhang, Y . Zang, Y . Cao, Y . Guo, D. Lin, and J. Wang, “Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree,” inInt. Conf. Comput. Vis., 2025, pp. 13 614–13 624
work page 2025
-
[52]
Advancing complex video object segmentation via progressive concept construction,
Z. Zhang, S. Ding, X. Dong, S. He, J. Lin, J. Tang, Y . Zang, Y . Cao, D. Lin, and J. Wang, “Advancing complex video object segmentation via progressive concept construction,” inInt. Conf. Learn. Represent., 2026. [Online]. Available: https://openreview.net/forum?id=hDM3YphhVx
work page 2026
-
[53]
A new approach to linear filtering and prediction problems,
R. E. Kalman, “A new approach to linear filtering and prediction problems,”Journal of Basic Engineering, vol. 82, no. 1, pp. 35–45, 1960
work page 1960
-
[54]
Lasot: A high- quality large-scale single object tracking benchmark,
H. Fan, H. Bai, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, Harshit, M. Huang, J. Liu, Y . Xu, C. Liao, L. Yuan, and H. Ling, “Lasot: A high- quality large-scale single object tracking benchmark,”Int. J. Comput. Vis., vol. 129, no. 2, p. 439–461, 2021
work page 2021
-
[55]
Anti-uav410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,
B. Huang, J. Li, J. Chen, G. Wang, J. Zhao, and T. Xu, “Anti-uav410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 5, pp. 2852–2865, 2023
work page 2023
-
[56]
X.-F. Zhu, T. Xu, J. Zhao, J.-W. Liu, K. Wang, G. Wang, J. Li, Q. Wang, L. Jin, Z. Zhu, J. Xing, and X.-J. Wu, “Evidential detection and tracking collaboration: New problem, benchmark and algorithm for robust anti- uav system,”arXiv:2306.15767, 2023
-
[57]
Vision-based anti-uav detection and tracking,
J. Zhao, J. Zhang, D. Li, and D. Wang, “Vision-based anti-uav detection and tracking,”IEEE Trans. Intell. Transp. Syst., vol. 23, no. 12, pp. 25 323–25 334, 2022
work page 2022
-
[58]
Visual object tracking using adaptive correlation filters,
D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y . M. Lui, “Visual object tracking using adaptive correlation filters,” inIEEE Conf. Comput. Vis. Pattern Recog., 2010, pp. 2544–2550
work page 2010
-
[59]
High-speed tracking with kernelized correlation filters,
J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, 2015
work page 2015
-
[60]
Discrimina- tive correlation filter with channel and spatial reliability,
A. Luke ˇzic, T. V oj´ır, L. C. Zajc, J. Matas, and M. Kristan, “Discrimina- tive correlation filter with channel and spatial reliability,” inIEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 4847–4856. 13
work page 2017
-
[61]
Learning discriminative model prediction for tracking,
G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” inInt. Conf. Comput. Vis., 2019, pp. 6181–6190
work page 2019
-
[62]
Tracking meets lora: Faster training, larger model, stronger performance,
L. Lin, H. Fan, Z. Zhang, Y . Wang, Y . Xu, and H. Ling, “Tracking meets lora: Faster training, larger model, stronger performance,” inEur . Conf. Comput. Vis., 2024, p. 300–318
work page 2024
-
[63]
Odtrack: Online dense temporal token learning for visual tracking,
Y . Zheng, B. Zhong, Q. Liang, Z. Mo, S. Zhang, and X. Li, “Odtrack: Online dense temporal token learning for visual tracking,” inAAAI, 2024, pp. 7588–7596
work page 2024
-
[64]
Hiera: A hierarchical vision transformer without the bells-and-whistles,
C. Ryali, Y .-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y . Huang, V . Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffmanet al., “Hiera: A hierarchical vision transformer without the bells-and-whistles,” inInt. Conf. Mach. Learn., 2023, pp. 29 441–29 454
work page 2023
-
[65]
Distance-iou loss: Faster and better learning for bounding box regression,
Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss: Faster and better learning for bounding box regression,” inAAAI, vol. 34, no. 07, 2020, pp. 12 993–13 000
work page 2020
-
[66]
Seqtrack: Sequence to sequence learning for visual object tracking,
X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 14 572–14 581
work page 2023
-
[67]
Robust object modeling for visual tracking,
Y . Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” inInt. Conf. Comput. Vis., 2023, pp. 9555–9566
work page 2023
-
[68]
Hiptrack: Visual tracking with historical prompts,
W. Cai, Q. Liu, and Y . Wang, “Hiptrack: Visual tracking with historical prompts,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 19 258– 19 267
work page 2024
-
[69]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997
work page 1997
-
[70]
Generalized intersection over union: A metric and a loss for bounding box regression,
H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inIEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 658–666. Deyi Zhureceived the B.S. degree from the De- partment of Automation, Tsinghua University, in
work page 2019
-
[71]
His current research interests include computer vision and embodied intelligence
He is currently pursuing the Ph.D degree with Tsinghua Shenzhen International Graduate School, Tsinghua University. His current research interests include computer vision and embodied intelligence. Yuji Wangreceived the B.S. degree in Electric and Electronic Engineering from the University of Elec- tronic Science and Technology of China (UESTC) in
-
[72]
He is currently a second-year master student with the Shenzhen International Graduate School, Tsinghua University, supervised by Prof. Yansong Tang. His research interests focus on computer vi- sion, including vision-language models, tool-calling, multimodal learning, image/video segmentation and tracking. He has published papers in top conferences such a...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.