SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Aljo\v{s}a O\v{s}ep; Jindong Gu; Laura Leal-Taix\'e; Mark Weber; Rajat Koner; Shuaicong Wu; Suprosanna Shit; Tanveer Hannan; Thomas Seidl

arxiv: 2510.13016 · v3 · pith:TTZNEXN6new · submitted 2025-10-14 · 💻 cs.CV

SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Tanveer Hannan , Shuaicong Wu , Mark Weber , Suprosanna Shit , Jindong Gu , Rajat Koner , Aljo\v{s}a O\v{s}ep , Laura Leal-Taix\'e

show 1 more author

Thomas Seidl

This is my paper

Pith reviewed 2026-05-18 07:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords video groundingspatio-temporal localizationaction recognitionmulti-instancebenchmarknatural languageobject tracking

0 comments

The pith

SVAG-Bench introduces a task requiring models to detect, track, and temporally localize every object matching a language query in multi-actor video scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that progress in embodied AI requires benchmarks that test the combined skills of spatial detection, temporal localization, and tracking of multiple actors from natural language descriptions. Existing video benchmarks examine these capabilities separately and therefore cannot assess their integration in realistic scenes. The authors respond by defining the Spatio-temporal Video Action Grounding task and releasing SVAG-Bench, a dataset of 688 videos containing 19,590 annotations and 903 action verbs drawn from urban, wildlife, and traffic footage. With an average of 28.5 queries per video, the benchmark supports detailed measurement of how models handle actor disambiguation and action composition. The work also supplies a standardized evaluation toolkit and a modular baseline model.

Core claim

The paper presents Spatio-temporal Video Action Grounding (SVAG) as the task of simultaneously detecting, tracking, and temporally localizing all objects that satisfy a natural language query in complex, multi-actor scenes, supported by the SVAG-Bench dataset containing 688 videos, 19,590 annotations, and 903 unique action verbs.

What carries the argument

The SVAG task, which unifies spatial detection, object tracking, and temporal localization for multiple instances based on language queries in crowded scenes.

Load-bearing premise

The annotation pipeline of expert manual labeling combined with automated paraphrase augmentation and human verification produces linguistically diverse and factually correct labels that accurately capture multi-actor disambiguation without systematic bias or error.

What would settle it

An independent review of a sample of annotations that identifies consistent mismatches between the language queries and the actual actors or timings present in the videos.

Figures

Figures reproduced from arXiv: 2510.13016 by Aljo\v{s}a O\v{s}ep, Jindong Gu, Laura Leal-Taix\'e, Mark Weber, Rajat Koner, Shuaicong Wu, Suprosanna Shit, Tanveer Hannan, Thomas Seidl.

**Figure 2.** Figure 2: Statistics of SVAG-Bench. The majority of queries fall within the range of 6 to 10 words. discrimination, by featuring multiple visually similar instances of the same category engaged in distinct actions. The dataset sources videos from established multi-object tracking benchmarks—MOT17 [4], MOT20 [5], and OVIS [22]—selecting sequences where objects of similar appearance perform different actions. Human an… view at source ↗

**Figure 3.** Figure 3: Overview of the SVAGFormer pipeline for Spatio-temporal Video Action Grounding (SVAG). Given a natural language query (e.g., “A person is dancing in the open area”), the model first performs temporal grounding to localize the relevant video segment (Start: 2543 → End: 2782), followed by spatial grounding to identify the target person across frames. the usefulness of this dataset for training and evaluating… view at source ↗

**Figure 4.** Figure 4: Flowchart for processing evaluation. Spatial and temporal evaluations are conducted separately on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: A qualitative visualization example of a zebra performing a fine-grained action: tilting its head to the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Statistics on OVIS dataset. High performance on categories like Dog, Rabbit, and Airplane. Dynamic actions like move, fight lead successful sequences. develop models to tackle the SVAG task. The competition is hosted on the Codabench platform3 , with SVAGEval as the official evaluation benchmark. Several teams have submitted their results, and a summary leaderboard is presented in Tab. 4. We conducted seve… view at source ↗

read the original abstract

A truly capable AI system must do more than detect objects or recognize activities in isolation. It must form unified, grounded representations of who is acting, what they are doing, and when and where these actions unfold. These representations provide the perceptual bedrock for high-level reasoning, planning, and embodied interaction in the real world. Building such agents is central to long-horizon goals in embodied AI and robotics. Current video benchmarks evaluate fragments of these capabilities in isolation. They focus on either spatial grounding, object tracking, or temporal localization. As a result, they cannot rigorously measure progress on their joint, multi-instance integration. We introduce Spatio-temporal Video Action Grounding (SVAG), a task and benchmark that explicitly targets this unified competence by requiring models to simultaneously detect, track, and temporally localize all objects that satisfy a natural language query in complex, multi-actor scenes. To support this task, we construct SVAG-Bench. It comprises 688 videos, 19,590 verified annotations, and 903 unique action verbs drawn from crowded urban environments, wildlife, and traffic surveillance. Each video has on average 28.5 action-centric queries. This yields the densest annotation among comparable video grounding benchmarks and enables fine-grained evaluation of multi-actor disambiguation, temporal overlap, and action compositionality. Annotations are produced by a pipeline that combines expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification to ensure both linguistic diversity and correctness. We further release SVAGEval, a standardized multi-referent evaluation toolkit. We also introduce SVAGFormer, a strong modular baseline architecture for SVAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces the Spatio-temporal Video Action Grounding (SVAG) task, which requires models to jointly detect, track, and temporally localize all instances satisfying a natural language query in multi-actor scenes. It presents SVAG-Bench (688 videos, 19,590 verified annotations, 903 verbs from urban/wildlife/traffic domains, average 28.5 queries per video) as the densest such benchmark, produced via expert labeling + GPT-3.5 paraphrasing + human verification, plus the SVAGEval toolkit and SVAGFormer baseline.

Significance. If the annotations prove reliable, the benchmark would fill a clear gap by enabling rigorous evaluation of integrated spatio-temporal grounding rather than isolated subtasks, supporting progress toward embodied AI. The annotation density, multi-referent focus, and release of a standardized evaluation toolkit are concrete strengths that could facilitate reproducible comparisons.

major comments (1)

[Abstract] Abstract: The central claim that the pipeline of expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification yields linguistically diverse and factually correct labels for multi-actor disambiguation rests on an unquantified assertion. No inter-annotator agreement scores, verification error rates, or failure-mode analysis for temporal overlap or instance disambiguation are reported, directly affecting the reliability of the 'densest annotation' and evaluation claims.

minor comments (1)

Provide explicit details on video selection criteria, exact train/val/test splits, and how the 903 verbs were curated to allow independent reproduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of SVAG-Bench in advancing integrated spatio-temporal grounding research. We address the concern regarding annotation reliability below and commit to revisions that directly strengthen the supporting evidence.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the pipeline of expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification yields linguistically diverse and factually correct labels for multi-actor disambiguation rests on an unquantified assertion. No inter-annotator agreement scores, verification error rates, or failure-mode analysis for temporal overlap or instance disambiguation are reported, directly affecting the reliability of the 'densest annotation' and evaluation claims.

Authors: We agree that the abstract's assertion would be more robust with explicit quantitative backing. Section 3.2 of the manuscript describes the three-stage pipeline (expert manual labeling by trained annotators, GPT-3.5 paraphrase augmentation for linguistic diversity, and independent human verification for factual accuracy), yet no numerical metrics are supplied. We will revise the paper to add: (1) inter-annotator agreement scores computed via Fleiss' kappa on a 10% random subset of videos, covering verb selection, temporal interval boundaries, and instance ID assignment; (2) verification error rates, defined as the fraction of GPT-generated paraphrases rejected or corrected during the final human review; and (3) a concise failure-mode analysis in the annotation section that discusses observed challenges with concurrent actions (temporal overlap) and crowded-scene disambiguation, supported by representative examples from the urban and wildlife subsets. These additions will appear in a dedicated subsection on annotation quality and will be referenced in the abstract and introduction. revision: yes

Circularity Check

0 steps flagged

No circularity in task definition or benchmark construction

full rationale

The paper defines a new task (SVAG) requiring simultaneous detection, tracking, and temporal localization of all objects matching a natural language query, then constructs SVAG-Bench via expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification. No equations, parameter fitting, or predictions appear that reduce to inputs by construction. The contribution is self-contained as an independent task and dataset release (688 videos, 19,590 annotations, 903 verbs) without load-bearing self-citations, uniqueness theorems from prior author work, or renaming of known results. Annotation pipeline details address quality but do not create circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard computer vision assumptions about the validity of human-verified annotations and the utility of the proposed task for downstream embodied AI, without introducing new free parameters, axioms, or invented entities.

axioms (1)

domain assumption Unified grounded representations of actor, action, time, and location are required for high-level reasoning in embodied AI.
Stated directly in the opening of the abstract as the perceptual bedrock.

pith-pipeline@v0.9.0 · 5873 in / 1175 out tokens · 35609 ms · 2026-05-18T07:02:35.278046+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Spatio-temporal Video Action Grounding (SVAG), a task and benchmark that explicitly targets this unified competence by requiring models to simultaneously detect, track, and temporally localize all objects that satisfy a natural language query in complex, multi-actor scenes.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark
cs.CV 2026-05 unverdicted novelty 8.0

VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Flashvtg: Feature layering and adaptive score handling network for video temporal grounding

Zhuo Cao, Bingqing Zhang, Heming Du, Xin Yu, Xue Li, and Sen Wang. Flashvtg: Feature layering and adaptive score handling network for video temporal grounding. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 9208–9218, February

work page
[2]

End-to-end multi-modal video temporal grounding.Advances in Neural Information Processing Systems, 34:28442–28453, 2021

Yi-Wen Chen, Yi-Hsuan Tsai, and Ming-Hsuan Yang. End-to-end multi-modal video temporal grounding.Advances in Neural Information Processing Systems, 34:28442–28453, 2021. 4

work page 2021
[3]

Tao: A large-scale benchmark for tracking any object

Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, and Deva Ramanan. Tao: A large-scale benchmark for tracking any object. InEuropean conference on computer vision, pages 436–454. Springer, 2020. 4

work page 2020
[4]

Motchallenge: A benchmark for single-camera multiple target tracking.International Journal of Computer Vision, 129:845–881, 2021

Patrick Dendorfer, Aljosa Osep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, and Laura Leal-Taixé. Motchallenge: A benchmark for single-camera multiple target tracking.International Journal of Computer Vision, 129:845–881, 2021. 5, 8

work page 2021
[5]

arXiv preprint arXiv:2003.09003

Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taixé. Mot20: A benchmark for multi object tracking in crowded scenes.arXiv preprint arXiv:2003.09003, 2020. 5, 8

work page arXiv 2003
[6]

Lasot: A high-quality benchmark for large-scale single object tracking

Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5374–5383, 2019. 2, 4

work page 2019
[7]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017. 2, 4

work page 2017
[8]

Context-guided spatio-temporal video grounding

Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, and Libo Zhang. Context-guided spatio-temporal video grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18330–18339, 2024. 1

work page 2024
[9]

Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval

Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiao-Yong Wei, Chang Wen Chen, and Qing Li. Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7249–7258, 2024. 7

work page 2024
[10]

Video object segmentation with language referring expressions

Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video object segmentation with language referring expressions. InComputer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part IV 14, pages 123–141. Springer, 2019. 1

work page 2018
[11]

Enhancing temporal action local- ization: Advanced s6 modeling with recurrent mechanism.arXiv preprint arXiv:2407.13078,

Sangyoun Lee, Juho Jung, Changdae Oh, and Sunghee Yun. Enhancing temporal action local- ization: Advanced s6 modeling with recurrent mechanism.arXiv preprint arXiv:2407.13078,

work page arXiv
[12]

Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858,

Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858,

work page
[13]

Tvqa+: Spatio-temporal grounding for video question answering.arXiv preprint arXiv:1904.11574, 2019

Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering.arXiv preprint arXiv:1904.11574, 2019. 4

work page arXiv 1904
[14]

Univtg: Towards unified video-language temporal grounding

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jin- peng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video-language temporal grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023. 4

work page 2023
[15]

Stvgformer: Spatio-temporal video grounding with static-dynamic cross-modal understanding

Zihang Lin, Chaolei Tan, Jian-Fang Hu, Zhi Jin, Tiancai Ye, and Wei-Shi Zheng. Stvgformer: Spatio-temporal video grounding with static-dynamic cross-modal understanding. InProceed- ings of the 4th on Person in Context Workshop, pages 1–5, 2022. 4 11

work page 2022
[16]

End-to-end temporal action detection with 1b parameters across 1000 frames

Shuming Liu, Chen-Lin Zhang, Chen Zhao, and Bernard Ghanem. End-to-end temporal action detection with 1b parameters across 1000 frames. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18591–18601, 2024. 1

work page 2024
[17]

Hota: A higher order metric for evaluating multi-object tracking

Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal- Taixé, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision, 129:548–578, 2021. 7

work page 2021
[18]

Spatial–temporal video grounding with cross-modal understanding and enhancement.Expert Systems with Applications, 271:126650, 2025

Shu Luo, Jingyu Pan, Da Cao, Jiawei Wang, Yuquan Le, and Meng Liu. Spatial–temporal video grounding with cross-modal understanding and enhancement.Expert Systems with Applications, 271:126650, 2025. 4

work page 2025
[19]

Modeling context between objects for referring expression understanding

Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Modeling context between objects for referring expression understanding. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 792–807. Springer, 2016. 1

work page 2016
[20]

Type-to-track: Retrieve any object via prompt-based tracking.Advances in Neural Information Processing Systems, 36:3205–3219,

Pha Nguyen, Kha Gia Quach, Kris Kitani, and Khoa Luu. Type-to-track: Retrieve any object via prompt-based tracking.Advances in Neural Information Processing Systems, 36:3205–3219,

work page
[21]

OpenAI.https://chatgpt.com/, 2023. 3, 5

work page 2023
[22]

Occluded video instance segmentation: A benchmark

Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip HS Torr, and Song Bai. Occluded video instance segmentation: A benchmark. International Journal of Computer Vision, 130(8):2022–2039, 2022. 5, 8

work page 2022
[23]

Chatvtg: Video temporal grounding via chat with video dialogue large language models

Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1847–1856, 2024. 4

work page 2024
[24]

Grounding action descriptions in videos.Transactions of the Association for Computational Linguistics, 1:25–36, 2013

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Grounding action descriptions in videos.Transactions of the Association for Computational Linguistics, 1:25–36, 2013. 2, 4

work page 2013
[25]

Urvos: Unified referring video object segmentation network with a large-scale benchmark

Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pages 208–223. Springer, 2020. 2, 3, 4

work page 2020
[26]

Annotating objects and relations in user-generated videos

Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. Annotating objects and relations in user-generated videos. InProceedings of the 2019 on International Conference on Multimedia Retrieval, pages 279–287, 2019. 4

work page 2019
[27]

Temporal action localization with enhanced instant discriminability.arXiv preprint arXiv:2309.05590, 2023

Dingfeng Shi, Qiong Cao, Yujie Zhong, Shan An, Jian Cheng, Haogang Zhu, and Dacheng Tao. Temporal action localization with enhanced instant discriminability.arXiv preprint arXiv:2309.05590, 2023. 1

work page arXiv 2023
[28]

Human-centric spatio-temporal video grounding with visual transformers.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8238–8249, 2021

Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. Human-centric spatio-temporal video grounding with visual transformers.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8238–8249, 2021. 1, 2, 3, 4

work page 2021
[29]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Spacevllm: Endowing multimodal large language model with spatio-temporal video grounding capability.arXiv preprint arXiv:2503.13983, 2025

Jiankang Wang, Zhihan Zhang, Zhihang Liu, Yang Li, Jiannan Ge, Hongtao Xie, and Yongdong Zhang. Spacevllm: Endowing multimodal large language model with spatio-temporal video grounding capability.arXiv preprint arXiv:2503.13983, 2025. 4

work page arXiv 2025
[31]

Internvideo2: Scaling foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InEuropean Conference on Computer Vision, pages 396–416. Springer,

work page
[32]

Referring multi-object tracking

Dongming Wu, Wencheng Han, Tiancai Wang, Xingping Dong, Xiangyu Zhang, and Jianbing Shen. Referring multi-object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14633–14642, 2023. 1, 4, 6, 7

work page 2023
[33]

Visual relation grounding in videos

Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, and Tat-Seng Chua. Visual relation grounding in videos. InEuropean conference on computer vision, pages 447–464. Springer,

work page
[34]

Point-supervised video temporal grounding

Zhe Xu, Kun Wei, Xu Yang, and Cheng Deng. Point-supervised video temporal grounding. IEEE Transactions on Multimedia, 25:6121–6131, 2022. 4

work page 2022
[35]

Bootstrapping referring multi-object tracking.arXiv preprint arXiv:2406.05039, 2024

Yani Zhang, Dongming Wu, Wencheng Han, and Xingping Dong. Bootstrapping referring multi-object tracking.arXiv preprint arXiv:2406.05039, 2024. 4, 6, 8

work page arXiv 2024
[36]

Where does it exist: Spatio-temporal video grounding for multi-form sentences

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10668–10677, 2020. 1, 2, 3, 4 13

work page 2020

[1] [1]

Flashvtg: Feature layering and adaptive score handling network for video temporal grounding

Zhuo Cao, Bingqing Zhang, Heming Du, Xin Yu, Xue Li, and Sen Wang. Flashvtg: Feature layering and adaptive score handling network for video temporal grounding. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 9208–9218, February

work page

[2] [2]

End-to-end multi-modal video temporal grounding.Advances in Neural Information Processing Systems, 34:28442–28453, 2021

Yi-Wen Chen, Yi-Hsuan Tsai, and Ming-Hsuan Yang. End-to-end multi-modal video temporal grounding.Advances in Neural Information Processing Systems, 34:28442–28453, 2021. 4

work page 2021

[3] [3]

Tao: A large-scale benchmark for tracking any object

Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, and Deva Ramanan. Tao: A large-scale benchmark for tracking any object. InEuropean conference on computer vision, pages 436–454. Springer, 2020. 4

work page 2020

[4] [4]

Motchallenge: A benchmark for single-camera multiple target tracking.International Journal of Computer Vision, 129:845–881, 2021

Patrick Dendorfer, Aljosa Osep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, and Laura Leal-Taixé. Motchallenge: A benchmark for single-camera multiple target tracking.International Journal of Computer Vision, 129:845–881, 2021. 5, 8

work page 2021

[5] [5]

arXiv preprint arXiv:2003.09003

Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taixé. Mot20: A benchmark for multi object tracking in crowded scenes.arXiv preprint arXiv:2003.09003, 2020. 5, 8

work page arXiv 2003

[6] [6]

Lasot: A high-quality benchmark for large-scale single object tracking

Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5374–5383, 2019. 2, 4

work page 2019

[7] [7]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017. 2, 4

work page 2017

[8] [8]

Context-guided spatio-temporal video grounding

Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, and Libo Zhang. Context-guided spatio-temporal video grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18330–18339, 2024. 1

work page 2024

[9] [9]

Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval

Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiao-Yong Wei, Chang Wen Chen, and Qing Li. Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7249–7258, 2024. 7

work page 2024

[10] [10]

Video object segmentation with language referring expressions

Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video object segmentation with language referring expressions. InComputer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part IV 14, pages 123–141. Springer, 2019. 1

work page 2018

[11] [11]

Enhancing temporal action local- ization: Advanced s6 modeling with recurrent mechanism.arXiv preprint arXiv:2407.13078,

Sangyoun Lee, Juho Jung, Changdae Oh, and Sunghee Yun. Enhancing temporal action local- ization: Advanced s6 modeling with recurrent mechanism.arXiv preprint arXiv:2407.13078,

work page arXiv

[12] [12]

Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858,

Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858,

work page

[13] [13]

Tvqa+: Spatio-temporal grounding for video question answering.arXiv preprint arXiv:1904.11574, 2019

Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering.arXiv preprint arXiv:1904.11574, 2019. 4

work page arXiv 1904

[14] [14]

Univtg: Towards unified video-language temporal grounding

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jin- peng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video-language temporal grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023. 4

work page 2023

[15] [15]

Stvgformer: Spatio-temporal video grounding with static-dynamic cross-modal understanding

Zihang Lin, Chaolei Tan, Jian-Fang Hu, Zhi Jin, Tiancai Ye, and Wei-Shi Zheng. Stvgformer: Spatio-temporal video grounding with static-dynamic cross-modal understanding. InProceed- ings of the 4th on Person in Context Workshop, pages 1–5, 2022. 4 11

work page 2022

[16] [16]

End-to-end temporal action detection with 1b parameters across 1000 frames

Shuming Liu, Chen-Lin Zhang, Chen Zhao, and Bernard Ghanem. End-to-end temporal action detection with 1b parameters across 1000 frames. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18591–18601, 2024. 1

work page 2024

[17] [17]

Hota: A higher order metric for evaluating multi-object tracking

Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal- Taixé, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision, 129:548–578, 2021. 7

work page 2021

[18] [18]

Spatial–temporal video grounding with cross-modal understanding and enhancement.Expert Systems with Applications, 271:126650, 2025

Shu Luo, Jingyu Pan, Da Cao, Jiawei Wang, Yuquan Le, and Meng Liu. Spatial–temporal video grounding with cross-modal understanding and enhancement.Expert Systems with Applications, 271:126650, 2025. 4

work page 2025

[19] [19]

Modeling context between objects for referring expression understanding

Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Modeling context between objects for referring expression understanding. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 792–807. Springer, 2016. 1

work page 2016

[20] [20]

Type-to-track: Retrieve any object via prompt-based tracking.Advances in Neural Information Processing Systems, 36:3205–3219,

Pha Nguyen, Kha Gia Quach, Kris Kitani, and Khoa Luu. Type-to-track: Retrieve any object via prompt-based tracking.Advances in Neural Information Processing Systems, 36:3205–3219,

work page

[21] [21]

OpenAI.https://chatgpt.com/, 2023. 3, 5

work page 2023

[22] [22]

Occluded video instance segmentation: A benchmark

Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip HS Torr, and Song Bai. Occluded video instance segmentation: A benchmark. International Journal of Computer Vision, 130(8):2022–2039, 2022. 5, 8

work page 2022

[23] [23]

Chatvtg: Video temporal grounding via chat with video dialogue large language models

Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1847–1856, 2024. 4

work page 2024

[24] [24]

Grounding action descriptions in videos.Transactions of the Association for Computational Linguistics, 1:25–36, 2013

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Grounding action descriptions in videos.Transactions of the Association for Computational Linguistics, 1:25–36, 2013. 2, 4

work page 2013

[25] [25]

Urvos: Unified referring video object segmentation network with a large-scale benchmark

Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pages 208–223. Springer, 2020. 2, 3, 4

work page 2020

[26] [26]

Annotating objects and relations in user-generated videos

Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. Annotating objects and relations in user-generated videos. InProceedings of the 2019 on International Conference on Multimedia Retrieval, pages 279–287, 2019. 4

work page 2019

[27] [27]

Temporal action localization with enhanced instant discriminability.arXiv preprint arXiv:2309.05590, 2023

Dingfeng Shi, Qiong Cao, Yujie Zhong, Shan An, Jian Cheng, Haogang Zhu, and Dacheng Tao. Temporal action localization with enhanced instant discriminability.arXiv preprint arXiv:2309.05590, 2023. 1

work page arXiv 2023

[28] [28]

Human-centric spatio-temporal video grounding with visual transformers.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8238–8249, 2021

Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. Human-centric spatio-temporal video grounding with visual transformers.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8238–8249, 2021. 1, 2, 3, 4

work page 2021

[29] [29]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Spacevllm: Endowing multimodal large language model with spatio-temporal video grounding capability.arXiv preprint arXiv:2503.13983, 2025

Jiankang Wang, Zhihan Zhang, Zhihang Liu, Yang Li, Jiannan Ge, Hongtao Xie, and Yongdong Zhang. Spacevllm: Endowing multimodal large language model with spatio-temporal video grounding capability.arXiv preprint arXiv:2503.13983, 2025. 4

work page arXiv 2025

[31] [31]

Internvideo2: Scaling foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InEuropean Conference on Computer Vision, pages 396–416. Springer,

work page

[32] [32]

Referring multi-object tracking

Dongming Wu, Wencheng Han, Tiancai Wang, Xingping Dong, Xiangyu Zhang, and Jianbing Shen. Referring multi-object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14633–14642, 2023. 1, 4, 6, 7

work page 2023

[33] [33]

Visual relation grounding in videos

Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, and Tat-Seng Chua. Visual relation grounding in videos. InEuropean conference on computer vision, pages 447–464. Springer,

work page

[34] [34]

Point-supervised video temporal grounding

Zhe Xu, Kun Wei, Xu Yang, and Cheng Deng. Point-supervised video temporal grounding. IEEE Transactions on Multimedia, 25:6121–6131, 2022. 4

work page 2022

[35] [35]

Bootstrapping referring multi-object tracking.arXiv preprint arXiv:2406.05039, 2024

Yani Zhang, Dongming Wu, Wencheng Han, and Xingping Dong. Bootstrapping referring multi-object tracking.arXiv preprint arXiv:2406.05039, 2024. 4, 6, 8

work page arXiv 2024

[36] [36]

Where does it exist: Spatio-temporal video grounding for multi-form sentences

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10668–10677, 2020. 1, 2, 3, 4 13

work page 2020