Recognition: unknown
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
Pith reviewed 2026-05-07 14:04 UTC · model grok-4.3
The pith
MarkIt converts videos into query-marked versions with instance masks and frame indices to let Vid-LLMs output more accurate start and end times.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MarkIt transforms an input video into a query-conditioned marked video using an annotation-free query-to-mask grounding bridge. The bridge derives a compact set of canonical subject tags from the query through linguistic parsing and normalization, then maps the tags to query-conditioned instance masks with text-conditioned open-vocabulary segmentation. It further embeds lightweight semantic instance markers and a persistent frame index into each frame, converting long-range temporal reasoning into explicit visual cues that existing Vid-LLMs can exploit for more reliable start- and end-time predictions.
What carries the argument
The annotation-free query-to-mask grounding bridge (Q2M-Bridge), which parses a natural-language query into subject tags, produces corresponding instance masks, and embeds semantic markers plus frame indices directly into video frames.
If this is right
- The marked video format produces state-of-the-art results on multiple moment retrieval and highlight detection benchmarks.
- The same marked input yields consistent temporal grounding gains across a wide range of existing Vid-LLMs.
- MarkIt functions as an inference-time plug-and-play addition that requires no changes to model weights.
- The approach remains fully compatible with subsequent supervised fine-tuning of the underlying Vid-LLM.
Where Pith is reading between the lines
- The marker approach could be tested on longer videos or multi-shot sequences to check whether persistent frame indices continue to help when entity tracking spans many minutes.
- Because the bridge relies on open-vocabulary segmentation, performance may vary with the quality of the underlying segmentation model and could be improved by swapping in stronger segmenters.
- The explicit visual cues might reduce the data needed to fine-tune new Vid-LLMs for temporal tasks, since the model receives direct location signals rather than having to learn them implicitly.
Load-bearing premise
The linguistic parsing and open-vocabulary segmentation steps correctly identify and mask only the query-relevant subjects without creating misleading visual noise that would confuse the Vid-LLM.
What would settle it
Apply MarkIt to a benchmark video where the generated masks highlight the wrong objects, then compare the Vid-LLM's temporal grounding accuracy against the unmarked baseline to see whether performance drops.
Figures
read the original abstract
Video temporal grounding (VTG) aims to localize the start and end timestamps of the event described by a given query within an untrimmed video. Despite the strong open-world video understanding and recognition ability of video language large models (Vid-LLMs), outputting precise temporal grounding information remains challenging, since explicit temporal cues are scarce in untrimmed videos, and query-relevant entities are hard to track consistently across the video timeline. In this paper, we present \MarkIt{}, a training-free framework that transforms an input video into a query-conditioned marked video, which empowers Vid-LLMs to generate more reliable temporal localization predictions. The core component of \MarkIt{} is an annotation-free query-to-mask grounding bridge (Q2M-Bridge). Given a natural-language query, it automatically derives a compact set of canonical subject tags through linguistic parsing and normalization, then maps these tags to query-conditioned instance masks using text-conditioned open-vocabulary segmentation. The bridge also embeds lightweight semantic instance markers and a persistent frame index into each frame, effectively transforming long-range temporal reasoning into explicit visual cues for Vid-LLMs. \MarkIt{} adopts an inference-time plug-and-play design, needs no modifications to Vid-LLM weights, and is fully compatible with supervised fine-tuning. Experiments conducted on multiple mainstream moment retrieval and highlight detection benchmarks demonstrate that \MarkIt {} achieves state-of-the-art results, delivering consistent temporal grounding improvements across a wide range of existing models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MarkIt, a training-free plug-and-play framework that augments untrimmed videos with query-conditioned visual markers to improve temporal grounding in Vid-LLMs. The core is the annotation-free Q2M-Bridge, which parses natural-language queries into canonical subject tags via linguistic processing and maps them to instance masks using open-vocabulary segmentation; these masks are then augmented with lightweight semantic markers and persistent frame indices before being fed to the Vid-LLM. The manuscript claims this yields state-of-the-art results on mainstream moment retrieval and highlight detection benchmarks while remaining compatible with existing models and optional supervised fine-tuning.
Significance. If the Q2M-Bridge produces sufficiently accurate masks, the approach would supply a lightweight, inference-only method to inject explicit spatial-temporal cues into Vid-LLMs, addressing a known weakness in long-range temporal reasoning. The training-free design and broad compatibility constitute clear practical strengths; the reuse of off-the-shelf open-vocabulary segmentation components also avoids the need for new training data or model modifications.
major comments (2)
- [§4 (Experiments) and §3.2 (Q2M-Bridge)] §4 (Experiments) and §3.2 (Q2M-Bridge): the SOTA claims rest on the assumption that the generated instance masks are accurate enough to serve as reliable cues rather than noise, yet the manuscript supplies no quantitative mask-quality metrics (e.g., IoU against any reference annotations), no failure-case analysis for ambiguous queries or occlusions, and no ablation that isolates the bridge's contribution from the rest of the pipeline. This directly affects whether the reported gains are robust.
- [§3.2 (Q2M-Bridge)] §3.2 (Q2M-Bridge): the description of deriving 'canonical subject tags' through linguistic parsing and normalization lacks detail on the exact parser, normalization rules, or handling of multi-subject or negated queries; without such specification or error-rate measurements, it is impossible to assess how often the subsequent open-vocabulary segmentation receives a correct text prompt.
minor comments (2)
- The abstract states 'consistent temporal grounding improvements across a wide range of existing models' but the main text should explicitly list the Vid-LLMs tested and report per-model deltas for transparency.
- [Notation] Notation for the 'persistent frame index' and 'lightweight semantic instance markers' should be formalized (e.g., how the index is rendered and whether markers are text overlays or visual symbols) to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation and strengthen the empirical support for MarkIt. We address each major point below and commit to revisions that improve transparency without altering the core claims.
read point-by-point responses
-
Referee: [§4 (Experiments) and §3.2 (Q2M-Bridge)] §4 (Experiments) and §3.2 (Q2M-Bridge): the SOTA claims rest on the assumption that the generated instance masks are accurate enough to serve as reliable cues rather than noise, yet the manuscript supplies no quantitative mask-quality metrics (e.g., IoU against any reference annotations), no failure-case analysis for ambiguous queries or occlusions, and no ablation that isolates the bridge's contribution from the rest of the pipeline. This directly affects whether the reported gains are robust.
Authors: We agree that an explicit ablation isolating the Q2M-Bridge and qualitative failure-case analysis would strengthen the manuscript. The current results demonstrate consistent gains across multiple Vid-LLMs and benchmarks, which indirectly validates the utility of the generated markers; however, we will add (i) an ablation comparing performance with and without the Q2M-Bridge and (ii) a dedicated subsection with representative failure cases for ambiguous queries and occlusions. Regarding quantitative IoU, the moment-retrieval and highlight-detection benchmarks provide only temporal annotations, not per-frame instance masks aligned to the parsed subject tags; therefore we cannot compute reference IoU without introducing new annotations. We will instead report proxy statistics such as mask coverage over query-relevant regions and note this limitation explicitly. revision: partial
-
Referee: [§3.2 (Q2M-Bridge)] §3.2 (Q2M-Bridge): the description of deriving 'canonical subject tags' through linguistic parsing and normalization lacks detail on the exact parser, normalization rules, or handling of multi-subject or negated queries; without such specification or error-rate measurements, it is impossible to assess how often the subsequent open-vocabulary segmentation receives a correct text prompt.
Authors: We will revise §3.2 to specify the exact pipeline: we employ spaCy for dependency parsing to extract noun phrases as candidate subjects, apply rule-based normalization (lowercasing, lemmatization, removal of determiners and modifiers), and handle multi-subject queries by emitting multiple tags while discarding negated phrases via dependency negation detection. We will also include a small-scale error analysis on 200 randomly sampled queries from the evaluation sets, reporting the percentage of tags that correctly match the intended referent. These additions will make the tag-derivation step fully reproducible and allow readers to gauge prompt quality for the downstream open-vocabulary segmenter. revision: yes
- Quantitative IoU evaluation of the generated masks, because the standard VTG benchmarks supply only temporal ground truth and not query-aligned instance masks.
Circularity Check
No significant circularity; derivation relies on external off-the-shelf components
full rationale
The paper describes a training-free plug-and-play pipeline whose core Q2M-Bridge is assembled from pre-existing linguistic parsing/normalization and text-conditioned open-vocabulary segmentation tools. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the provided text; the claimed temporal-grounding gains are presented as empirical outcomes on external benchmarks rather than reductions that hold by construction from the method's own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Linguistic parsing and normalization reliably extract a compact set of canonical subject tags from arbitrary natural-language queries.
- domain assumption Text-conditioned open-vocabulary segmentation can map the extracted tags to accurate query-conditioned instance masks across video frames.
invented entities (2)
-
Lightweight semantic instance markers
no independent evidence
-
Persistent frame index
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Univtg: Towards unified video-language temporal grounding,
K. Q. Lin, P. Zhang, J. Chen, S. Pramanick, D. Gao, A. J. Wang, R. Yan, and M. Z. Shou, “Univtg: Towards unified video-language temporal grounding,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2794–2804
2023
-
[2]
Context-aware biaffine localizing network for temporal sentence grounding,
D. Liu, X. Qu, J. Dong, P. Zhou, Y . Cheng, W. Wei, Z. Xu, and Y . Xie, “Context-aware biaffine localizing network for temporal sentence grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 235–11 244
2021
-
[3]
Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198
2024
-
[4]
Chatvtg: Video temporal grounding via chat with video dialogue large language models,
M. Qu, X. Chen, W. Liu, A. Li, and Y . Zhao, “Chatvtg: Video temporal grounding via chat with video dialogue large language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1847–1856
2024
-
[5]
Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,
Y . Guo, J. Liu, M. Li, D. Cheng, X. Tang, D. Sui, Q. Liu, X. Chen, and K. Zhao, “Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 3302–3310
2025
-
[6]
Timesuite: Improving mllms for long video understanding via grounded tuning,
X. Zeng, K. Li, C. Wang, X. Li, T. Jiang, Z. Yan, S. Li, Y . Shi, Z. Yue, Y . Wanget al., “Timesuite: Improving mllms for long video understanding via grounded tuning,”arXiv preprint arXiv:2410.19702, 2024
-
[7]
On the consistency of video large language models in temporal comprehension,
M. Jung, J. Xiao, B.-T. Zhang, and A. Yao, “On the consistency of video large language models in temporal comprehension,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 13 713–13 722
2025
-
[8]
Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection,
Y . Xiao, Z. Luo, Y . Liu, Y . Ma, H. Bian, Y . Ji, Y . Yang, and X. Li, “Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 18 709–18 719
2024
-
[9]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,
C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 108–24 118
2025
-
[10]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review arXiv 2024
-
[11]
Zero-shot video moment retrieval via off-the-shelf multimodal large language models,
Y . Xu, Y . Sun, B. Zhai, M. Li, W. Liang, Y . Li, and S. Du, “Zero-shot video moment retrieval via off-the-shelf multimodal large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 8978–8986
2025
-
[12]
Background-aware moment detection for video moment retrieval,
M. Jung, Y . Jang, S. Choi, J. Kim, J.-H. Kim, and B.-T. Zhang, “Background-aware moment detection for video moment retrieval,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 8586–8596
2025
-
[13]
Dense video captioning: A survey of techniques, datasets and evaluation protocols,
I. Qasim, A. Horsch, and D. Prasad, “Dense video captioning: A survey of techniques, datasets and evaluation protocols,”ACM Computing Surveys, vol. 57, no. 6, pp. 1–36, 2025
2025
-
[14]
Dense video captioning using unsupervised semantic information,
V . Estevam, R. Laroca, H. Pedrini, and D. Menotti, “Dense video captioning using unsupervised semantic information,”Journal of Visual Communication and Image Representation, vol. 107, p. 104385, 2025
2025
-
[15]
Lvd-2m: A long-take video dataset with temporally dense captions,
T. Xiong, Y . Wang, D. Zhou, Z. Lin, J. Feng, and X. Liu, “Lvd-2m: A long-take video dataset with temporally dense captions,”Advances in Neural Information Processing Systems, vol. 37, pp. 16 623–16 644, 2024
2024
-
[16]
Unsupervised video highlight detection by learning from audio and visual recurrence,
Z. Islam, S. Paul, and M. Rochan, “Unsupervised video highlight detection by learning from audio and visual recurrence,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 8702–8711
2025
-
[17]
Less is more: Learning highlight detection from video duration,
B. Xiong, Y . Kalantidis, D. Ghadiyaram, and K. Grauman, “Less is more: Learning highlight detection from video duration,” inProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1258–1267
2019
-
[18]
Contrastive learn- ing for unsupervised video highlight detection,
T. Badamdorj, M. Rochan, Y . Wang, and L. Cheng, “Contrastive learn- ing for unsupervised video highlight detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 042–14 052
2022
-
[19]
Query-dependent video representation for moment retrieval and highlight detection,
W. Moon, S. Hyun, S. Park, D. Park, and J.-P. Heo, “Query-dependent video representation for moment retrieval and highlight detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 23 023–23 033
2023
-
[20]
Tvqa+: Spatio-temporal grounding for video question answering,
J. Lei, L. Yu, T. Berg, and M. Bansal, “Tvqa+: Spatio-temporal grounding for video question answering,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 8211–8225
2020
-
[21]
Timecraft: Navigate weakly-supervised temporal grounded video question answering via bi-directional reasoning,
H. Liu, X. Ma, C. Zhong, Y . Zhang, and W. Lin, “Timecraft: Navigate weakly-supervised temporal grounded video question answering via bi-directional reasoning,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 92–107
2024
-
[22]
Grounded question-answering in long egocentric videos,
S. Di and W. Xie, “Grounded question-answering in long egocentric videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12 934–12 943
2024
-
[23]
Can i trust your answer? visually grounded video question answering,
J. Xiao, A. Yao, Y . Li, and T.-S. Chua, “Can i trust your answer? visually grounded video question answering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 204–13 214
2024
-
[24]
A survey on video temporal grounding with multimodal large lan- guage model,
J. Wu, W. Liu, Y . Liu, M. Liu, L. Nie, Z. Lin, and C. W. Chen, “A survey on video temporal grounding with multimodal large lan- guage model,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[25]
Tar-tvg: Enhancing vlms with timestamp anchor-constrained reasoning for temporal video grounding,
C. Guo, X. Mo, Y . Nie, X. Xu, C. Xu, F. Yu, and C. Long, “Tar-tvg: Enhancing vlms with timestamp anchor-constrained reasoning for temporal video grounding,”arXiv preprint arXiv:2508.07683, 2025
-
[26]
Alvarez, Lei Zhang, and Zhiding Yu
S. Wang, G. Chen, D.-a. Huang, Z. Li, M. Li, G. Li, J. M. Alvarez, L. Zhang, and Z. Yu, “Videoitg: Multimodal video understanding with instructed temporal grounding,”arXiv preprint arXiv:2507.13353, 2025
-
[27]
Towards visual-prompt temporal answer grounding in instructional video,
S. Li, B. Li, B. Sun, and Y . Weng, “Towards visual-prompt temporal answer grounding in instructional video,”IEEE transactions on pattern analysis and machine intelligence, vol. 46, no. 12, pp. 8836–8853, 2024
2024
-
[28]
Number it: Temporal grounding videos like flipping manga,
Y . Wu, X. Hu, Y . Sun, Y . Zhou, W. Zhu, F. Rao, B. Schiele, and X. Yang, “Number it: Temporal grounding videos like flipping manga,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 13 754–13 765
2025
-
[29]
Vtimellm: Empower llm to grasp video moments,
B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 271–14 280
2024
-
[30]
Timechat: A time-sensitive multimodal large language model for long video understanding,
S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 313–14 323
2024
-
[31]
Training-free video temporal grounding using large-scale pre-trained models,
M. Zheng, X. Cai, Q. Chen, Y . Peng, and Y . Liu, “Training-free video temporal grounding using large-scale pre-trained models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 20–37
2024
-
[32]
Trace: Temporal grounding video llm via causal event modeling,
Y . Guo, J. Liu, M. Li, Q. Liu, X. Chen, and X. Tang, “Trace: Temporal grounding video llm via causal event modeling,”arXiv preprint arXiv:2410.05643, 2024
-
[33]
Omni-rgpt: Unifying image and video region-level understanding via token marks,
M. Heo, M.-H. Chen, D.-A. Huang, S. Liu, S. Radhakrishnan, S. J. Kim, Y .-C. F. Wang, and R. Hachiuma, “Omni-rgpt: Unifying image and video region-level understanding via token marks,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3919–3930
2025
-
[34]
arXiv preprint arXiv:2501.04001 , year=
H. Yuan, X. Li, T. Zhang, Y . Sun, Z. Huang, S. Xu, S. Ji, Y . Tong, L. Qi, J. Fenget al., “Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos,”arXiv preprint arXiv:2501.04001, 2025
-
[35]
Sa2va- i: Improving sa2va results with consistent training and inference,
A. Nekrasov, A. Athar, D. de Geus, A. Hermans, and B. Leibe, “Sa2va- i: Improving sa2va results with consistent training and inference,”arXiv preprint arXiv:2509.19082, 2025
-
[36]
Videoglamm: A large multimodal model for pixel-level visual grounding in videos,
S. Munasinghe, H. Gani, W. Zhu, J. Cao, E. Xing, F. S. Khan, and S. Khan, “Videoglamm: A large multimodal model for pixel-level visual grounding in videos,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 036–19 046
2025
-
[37]
Videorefer suite: Advancing spatial- temporal object understanding with video llm,
Y . Yuan, H. Zhang, W. Li, Z. Cheng, B. Zhang, L. Li, X. Li, D. Zhao, W. Zhang, Y . Zhuanget al., “Videorefer suite: Advancing spatial- temporal object understanding with video llm,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 18 970–18 980
2025
-
[38]
Vip-llava: Making large multimodal models understand arbitrary visual prompts,
M. Cai, H. Liu, S. K. Mustikovela, G. P. Meyer, Y . Chai, D. Park, and Y . J. Lee, “Vip-llava: Making large multimodal models understand arbitrary visual prompts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12 914–12 923
2024
-
[39]
Generalized decoding for pixel, image, and language,
X. Zou, Z.-Y . Dou, J. Yang, Z. Gan, L. Li, C. Li, X. Dai, H. Behl, J. Wang, L. Yuanet al., “Generalized decoding for pixel, image, and language,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15 116–15 127
2023
-
[40]
Gsva: Generalized segmentation via multimodal large language models,
Z. Xia, D. Han, Y . Han, X. Pan, S. Song, and G. Huang, “Gsva: Generalized segmentation via multimodal large language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3858–3869
2024
-
[41]
Glamm: Pixel grounding large multimodal model,
H. Rasheed, M. Maaz, S. Shaji, A. Shaker, S. Khan, H. Cholakkal, R. M. Anwer, E. Xing, M.-H. Yang, and F. S. Khan, “Glamm: Pixel grounding large multimodal model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 009–13 018
2024
-
[42]
Lisa: Rea- soning segmentation via large language model,
X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Rea- soning segmentation via large language model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9579–9589
2024
-
[43]
Collavo: Crayon large language and vision model,
B.-K. Lee, B. Park, C. W. Kim, and Y . M. Ro, “Collavo: Crayon large language and vision model,”arXiv preprint arXiv:2402.11248, 2024
-
[44]
Segment anything,
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026
2023
-
[45]
SAM 2: Segment Anything in Images and Videos
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review arXiv 2024
-
[46]
GroundingGPT: Language enhanced multi-modal grounding model,
Z. Li, Q. Xu, D. Zhang, H. Song, Y . Cai, Q. Qi, R. Zhou, J. Pan, Z. Li, V . Tu, Z. Huang, and T. Wang, “GroundingGPT: Language enhanced multi-modal grounding model,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp...
2024
-
[47]
Lita: Language instructed temporal-localization assistant,
D.-A. Huang, S. Liao, S. Radhakrishnan, H. Yin, P. Molchanov, Z. Yu, and J. Kautz, “Lita: Language instructed temporal-localization assistant,” 2024. [Online]. Available: https://arxiv.org/abs/2403.19046
-
[48]
Vtg-llm: Integrating timestamp knowl- edge into video llms for enhanced video temporal grounding,
Y . Guo, J. Liu, M. Li, D. Cheng, X. Tang, D. Sui, Q. Liu, X. Chen, and K. Zhao, “Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,” 2024. [Online]. Available: https://arxiv.org/abs/2405.13382
-
[49]
Timechat: A time-sensitive multimodal large language model for long video understanding,
S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,”
-
[50]
Timechat: A time-sensitive multimodal large lan- guage model for long video understanding
[Online]. Available: https://arxiv.org/abs/2312.02051
-
[51]
Vtimellm: Empower LLM to grasp video moments
B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” 2023. [Online]. Available: https://arxiv.org/abs/2311.18445
-
[52]
Momentor: Ad- vancing video large language model with fine-grained temporal reasoning,
L. Qian, J. Li, Y . Wu, Y . Ye, H. Fei, T.-S. Chua, Y . Zhuang, and S. Tang, “Momentor: Advancing video large language model with fine-grained temporal reasoning,” 2024. [Online]. Available: https://arxiv.org/abs/2402.11435
-
[53]
Hawkeye: Training video-text llms for grounding text in videos,
Y . Wang, X. Meng, J. Liang, Y . Wang, Q. Liu, and D. Zhao, “Hawkeye: Training video-text llms for grounding text in videos,”
-
[54]
Hawkeye: Training video-text llms for grounding text in videos,
[Online]. Available: https://arxiv.org/abs/2403.10228
-
[55]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” 2024. [Online]. Available: https://arxiv.org/abs/2409.12191
work page internal anchor Pith review arXiv 2024
-
[57]
Tall: Temporal activity local- ization via language query,
J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity local- ization via language query,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 5267–5275
2017
-
[58]
Activitynet: A large-scale video benchmark for human activity under- standing,
F. Caba Heilbron, V . Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity under- standing,” inProceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961–970
2015
-
[59]
Detecting moments and highlights in videos via natural language queries,
J. Lei, T. L. Berg, and M. Bansal, “Detecting moments and highlights in videos via natural language queries,”Advances in Neural Information Processing Systems, vol. 34, pp. 11 846–11 858, 2021
2021
-
[60]
Long Context Transfer from Language to Vision
P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y . Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu, “Long context transfer from language to vision,”arXiv preprint arXiv:2406.16852, 2024
work page internal anchor Pith review arXiv 2024
-
[61]
Yoloe: Real-time seeing anything.arXiv preprint arXiv:2503.07465, 2025
A. Wang, L. Liu, H. Chen, Z. Lin, J. Han, and G. Ding, “Yoloe: Real-time seeing anything,”arXiv preprint arXiv:2503.07465, 2025. Appendix
-
[62]
Two men both dressed in athletic gear are standing and talking in an indoor weight lifting gym filled with other equipment
Prompt Templates: System Message for Normal- ized Subject Extraction You are an NLP tool for extracting normalized visual subjects for open-vocabulary object detection. ,→ ,→ The input is an English sentence describing an action in a video.,→ Your job is to return ONLY the grammatical subject(s), normalized into simple noun classes. ,→ ,→ Important: - The...
-
[63]
Overall trends are consistent with those observed on ActivityNet
Additional Analysis on Frame-index Rendering (Charades-STA) Table 5 reports the corresponding ablation study on Charades-STA. Overall trends are consistent with those observed on ActivityNet. Specifically, fixed-corner place- ments yield comparable performance, while center placement remains suboptimal, suggesting that intrusive overlays may interfere wit...
-
[64]
Table 6 reports the effect of different color parameteriza- tions on ActivityNet
Ablation on Color Parameterization Since visual attention models treat color contrast as a primary cue and integrate it with intensity and orientation to form saliency, we examine whether different color param- eterizations of MARKITmasks affect how reliably region markers attract attention during temporal grounding. Table 6 reports the effect of differen...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.