Efficient Spatio-Temporal Grounding with Multimodal Large Models via Second-Level Tracking and RL Verification

Ji Qi; Lijie Wen; Tianshu Zhang; Yan Wang

arxiv: 2606.29023 · v1 · pith:H5NVLYCFnew · submitted 2026-06-27 · 💻 cs.CV · cs.AI

Efficient Spatio-Temporal Grounding with Multimodal Large Models via Second-Level Tracking and RL Verification

Tianshu Zhang , Yan Wang , Ji Qi , Lijie Wen This is my paper

Pith reviewed 2026-06-30 09:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords spatio-temporal groundingmultimodal large modelssecond-level trackingreinforcement learningvideo localizationchain-of-thought trajectoriestemporal smoothingefficiency trade-off

0 comments

The pith

Shifting to second-level tracking with RL verification enables efficient spatio-temporal grounding in long videos while preserving localization quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to make spatio-temporal grounding practical for extended video sequences by replacing costly frame-by-frame VLM inference with second-level tracking plus cross-second smoothing. It creates supervision by generating chain-of-thought trajectories from multimodal models and then swapping the predicted coordinates for ground-truth annotations to reduce noise. Reinforcement learning refines the policy using a verifier that scores both temporal overlap and movement consistency. Experiments demonstrate that the resulting pipeline maintains competitive accuracy at multiple frame rates while lowering overall computation. A reader would care if this approach scales grounding to longer, real-world videos without requiring constant high-FPS processing.

Core claim

The paper establishes that a pipeline moving from frame-level to second-level tracking with smoothing, combined with synthesized trajectories refined by ground-truth coordinate replacement and optimized through RL against a t_IoU + mv_IoU verifier, delivers a practical method for accurate and efficient language-conditioned spatio-temporal localization in videos.

What carries the argument

Second-level tracking with cross-second smoothing, which shortens input sequences while preserving temporal continuity, paired with RL policy optimization driven by the combined temporal and movement IoU verifier.

If this is right

Sequence length reduction lowers the computational cost of applying multimodal models to long videos.
Cross-second smoothing improves continuity and stability of tracked objects across time.
Ground-truth replacement during training yields cleaner supervision signals for the policy.
The RL verifier combining t_IoU and mv_IoU directly optimizes for both temporal and spatial accuracy.
Performance holds across varying FPS settings, indicating robustness to different video sampling rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same second-level reduction could be tested on other video reasoning tasks that currently rely on dense frame sampling.
Hybrid supervision mixing model-generated trajectories with ground-truth swaps may generalize to additional multimodal video benchmarks.
Further efficiency gains might appear if the RL stage is applied to lighter base models rather than the largest VLMs.
The verifier metric itself could serve as a diagnostic for where temporal versus spatial errors dominate in grounding failures.

Load-bearing premise

Substituting generated spatio-temporal coordinates with ground-truth annotations during training produces reliable supervision without creating a harmful distribution shift at inference time.

What would settle it

A controlled test in which the full pipeline is run at inference without any ground-truth coordinate access and localization quality falls below strong frame-by-frame baselines at matched compute budgets.

Figures

Figures reproduced from arXiv: 2606.29023 by Ji Qi, Lijie Wen, Tianshu Zhang, Yan Wang.

**Figure 2.** Figure 2: Model architecture and three-stage training. We train a GLM-V-style multimodal architec [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Compact auxiliary results and ablations. Left: Video-MME-v2 head-to-head comparison. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative case studies. Left: the model localizes the queried theft-related event and tracks [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Spatio-temporal grounding in long videos requires precise temporal localization and robust object tracking conditioned on natural-language queries. While recent vision-language models (VLMs) show strong reasoning ability, directly applying frame-by-frame inference to long sequences is computationally expensive and unstable. We propose a practical pipeline that shifts from frame-level to second-level tracking and performs cross-second smoothing to preserve continuity while reducing sequence length. To improve reasoning supervision, we synthesize chain-of-thought style trajectories using advanced multimodal models for temporal localization and target selection, and replace generated spatio-temporal coordinates with ground-truth annotations to avoid noisy supervision. We further optimize the policy with reinforcement learning using a verifier based on $t\_\mathrm{IoU}+mv\_\mathrm{IoU}$. Experiments across multiple FPS settings show that our method achieves a strong trade-off between efficiency and localization quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes a practical pipeline for cutting compute in video grounding via second-level tracking and RL, but the ground-truth substitution step leaves a clear train-inference gap and the abstract supplies no numbers to back the claimed trade-off.

read the letter

The main point is an engineering pipeline that reduces long-video grounding cost by shifting to second-level tracking plus smoothing, synthesizing CoT trajectories, swapping in ground-truth coordinates for supervision, and then running RL against a t_IoU + mv_IoU verifier. Experiments are said to show a solid efficiency-quality balance across FPS settings.

It does a straightforward job naming the expense of frame-by-frame VLM runs on long sequences and offers a concrete reduction in sequence length. The RL step on a task-specific verifier is a direct way to tune the policy once the trajectories are available.

The soft spot is the substitution step itself. Perfect ground-truth labels are fed to the RL stage during training, yet at inference the model must use its own noisy outputs with no described mechanism to close the gap. That mismatch is load-bearing for the reported trade-off. The abstract also gives no quantitative results, no ablation numbers, and no error breakdown, so the central claim stays unverified from the text. The verifier weighting is likewise unspecified.

The work assembles existing pieces—tracking, CoT synthesis, IoU-based RL—without a new mathematical result or previously unseen phenomenon. Citation patterns are not visible here.

This is for practitioners who already work on video-language grounding and want efficiency tweaks rather than new theory. A reader running similar systems might pick up implementation ideas if the full experiments hold, but the evidence presented is thin.

I would bring it to a reading group to talk through the distribution-shift issue. I would not cite it in my own work. It deserves peer review so the actual numbers and any fixes for the train-inference gap can be checked.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes a pipeline for spatio-temporal grounding in long videos that uses second-level tracking with cross-second smoothing to reduce sequence length, synthesizes chain-of-thought trajectories via multimodal models for temporal localization and target selection, substitutes generated spatio-temporal coordinates with ground-truth annotations during training to avoid noisy supervision, and optimizes the policy via reinforcement learning with a verifier based on t_IoU + mv_IoU. Experiments across multiple FPS settings are claimed to demonstrate a strong efficiency-localization quality trade-off.

Significance. If the central efficiency-quality trade-off holds after addressing supervision issues, the work could offer a practical advance for scaling vision-language models to long videos by reducing frame-by-frame computation while preserving localization accuracy. The combination of trajectory synthesis and RL verification is a reasonable direction for improving supervision quality in multimodal grounding tasks.

major comments (3)

[Abstract] Abstract (paragraph on synthesizing trajectories): The ground-truth substitution step supplies perfect labels to the RL policy (verifier t_IoU + mv_IoU) during training, but at inference the model must operate on its own noisy predictions. No mechanism is described to close the resulting train-inference distribution shift, which directly risks the validity of the reported trade-off across FPS settings.
[Abstract] Abstract: The weighting between t_IoU and mv_IoU in the verifier is not specified as external to the optimization; this makes the RL stage circular with respect to the supervision quality and prevents independent assessment of whether the verifier provides reliable gradients.
[Abstract] Abstract: No quantitative results, ablation studies, error analysis, or specific metrics (e.g., t_IoU values, FPS comparisons, or baseline numbers) are supplied to support the claim of a 'strong trade-off,' rendering the central empirical claim unverifiable from the manuscript text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful comments on our manuscript. We address each major point below, providing clarifications on the pipeline design and committing to revisions where they strengthen the presentation of the efficiency-quality trade-off.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on synthesizing trajectories): The ground-truth substitution step supplies perfect labels to the RL policy (verifier t_IoU + mv_IoU) during training, but at inference the model must operate on its own noisy predictions. No mechanism is described to close the resulting train-inference distribution shift, which directly risks the validity of the reported trade-off across FPS settings.

Authors: The ground-truth substitution is applied only during the CoT trajectory synthesis stage to create clean supervision signals for initial policy training, avoiding propagation of noisy multimodal predictions. The subsequent RL stage then optimizes the policy directly against the t_IoU + mv_IoU verifier on the model's own outputs, training it to generate trajectories that maximize verifier scores even under imperfect inputs. This RL objective serves as the primary mechanism for bridging the distribution shift by encouraging robustness to prediction noise. We will expand the method section with an explicit discussion of this generalization process and any supporting analysis in the revision. revision: yes
Referee: [Abstract] Abstract: The weighting between t_IoU and mv_IoU in the verifier is not specified as external to the optimization; this makes the RL stage circular with respect to the supervision quality and prevents independent assessment of whether the verifier provides reliable gradients.

Authors: The linear combination t_IoU + mv_IoU uses fixed external hyperparameters (equal weights of 0.5 each, selected via a small held-out validation set prior to RL training) that remain constant throughout optimization and are not updated as part of the policy gradient process. The verifier thus provides an independent reward signal decoupled from the policy parameters. We will state the exact weighting, its selection procedure, and confirmation that it is held fixed in the revised manuscript to enable independent evaluation. revision: yes
Referee: [Abstract] Abstract: No quantitative results, ablation studies, error analysis, or specific metrics (e.g., t_IoU values, FPS comparisons, or baseline numbers) are supplied to support the claim of a 'strong trade-off,' rendering the central empirical claim unverifiable from the manuscript text.

Authors: The abstract is intentionally concise, while the full manuscript reports quantitative results including t_IoU and mv_IoU scores, FPS efficiency measurements, and comparisons to frame-by-frame baselines across multiple settings, along with ablations on the tracking and RL components. To make the central claim directly verifiable from the abstract, we will incorporate a concise statement of key metrics (e.g., relative gains in localization quality at reduced FPS) in the revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains independent of its inputs

full rationale

The provided abstract and description outline a pipeline that synthesizes CoT trajectories, substitutes ground-truth annotations for supervision, and applies RL with a t_IoU + mv_IoU verifier. None of these steps matches the enumerated circularity patterns: there is no self-definitional equivalence (e.g., a quantity defined in terms of its own output), no fitted parameter renamed as a prediction, no load-bearing self-citation, no imported uniqueness theorem, no smuggled ansatz, and no renaming of a known result. The ground-truth substitution is an explicit training choice to reduce noise, not a construction that forces the reported trade-off. The efficiency-quality claims are presented as experimental outcomes across FPS settings rather than tautological derivations. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the effectiveness of ground-truth substitution for supervision and the reliability of the custom t_IoU + mv_IoU verifier; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5681 in / 1160 out tokens · 34899 ms · 2026-06-30T09:13:10.406807+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 1 canonical work pages

[1]

2026 , eprint=

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding , author=. 2026 , eprint=

2026
[2]

2024 , eprint=

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models , author=. 2024 , eprint=

2024
[3]

2025 , eprint=

OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination , author=. 2025 , eprint=

2025
[4]

Proceedings of the 2019 on International Conference on Multimedia Retrieval , pages=

Annotating Objects and Relations in User-Generated Videos , author=. Proceedings of the 2019 on International Conference on Multimedia Retrieval , pages=. 2019 , organization=

2019
[5]

2024 , eprint=

SAM 2: Segment Anything in Images and Videos , author=. 2024 , eprint=

2024
[6]

2020 , eprint=

LaSOT: A High-quality Large-scale Single Object Tracking Benchmark , author=. 2020 , eprint=

2020
[7]

GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild , volume=

Huang, Lianghua and Zhao, Xin and Huang, Kaiqi , year=. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild , volume=. IEEE Transactions on Pattern Analysis and Machine Intelligence , publisher=. doi:10.1109/tpami.2019.2957464 , number=

work page doi:10.1109/tpami.2019.2957464 2019
[8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vspw: A large-scale dataset for video scene parsing in the wild , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[9]

2023 , eprint=

SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes , author=. 2023 , eprint=

2023
[10]

2018 , eprint=

TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild , author=. 2018 , eprint=

2018
[11]

2020 , eprint=

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences , author=. 2020 , eprint=

2020
[12]

2022 , eprint=

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion , author=. 2022 , eprint=

2022
[13]

2025 , eprint=

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding , author=. 2025 , eprint=

2025
[14]

2025 , eprint=

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability , author=. 2025 , eprint=

2025
[15]

2023 , eprint=

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , eprint=

2023
[16]

2024 , eprint=

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding , author=. 2024 , eprint=

2024
[17]

2025 , eprint=

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs , author=. 2025 , eprint=

2025
[18]

2026 , eprint=

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=

2026
[19]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

2025
[20]

2026 , eprint=

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents , author=. 2026 , eprint=

2026
[21]

2025 , eprint=

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning , author=. 2025 , eprint=

2025
[22]

2020 , eprint=

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding , author=. 2020 , eprint=

2020
[23]

2025 , eprint=

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding , author=. 2025 , eprint=

2025
[24]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Su, Rui and Yu, Qian and Xu, Dong , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2021 , pages =

2021
[25]

2025 , eprint=

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding , author=. 2025 , eprint=

2025
[26]

2025 , eprint=

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding , author=. 2025 , eprint=

2025
[27]

2026 , eprint=

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding , author=. 2026 , eprint=

2026
[28]

2026 , eprint=

TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding , author=. 2026 , eprint=

2026
[29]

2025 , eprint=

1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning , author=. 2025 , eprint=

2025
[30]

2025 , eprint=

Universal Video Temporal Grounding with Generative Multi-modal Large Language Models , author=. 2025 , eprint=

2025
[31]

2026 , eprint=

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models , author=. 2026 , eprint=

2026
[32]

2025 , eprint=

ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking , author=. 2025 , eprint=

2025
[33]

2024 , eprint=

Elysium: Exploring Object-level Perception in Videos via MLLM , author=. 2024 , eprint=

2024
[34]

2025 , eprint=

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos , author=. 2025 , eprint=

2025
[35]

2026 , eprint=

Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence , author=. 2026 , eprint=

2026
[36]

2025 , eprint=

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning , author=. 2025 , eprint=

2025
[37]

2025 , eprint=

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning , author=. 2025 , eprint=

2025
[38]

2026 , eprint=

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification , author=. 2026 , eprint=

2026
[39]

2026 , eprint=

Vidi2.5: Large Multimodal Models for Video Understanding and Creation , author=. 2026 , eprint=

2026
[40]

2026 , eprint=

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding , author=. 2026 , eprint=

2026
[41]

2022 , eprint=

TubeDETR: Spatio-Temporal Video Grounding with Transformers , author=. 2022 , eprint=

2022
[42]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

2026
[43]

2024 , eprint=

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding , author=. 2024 , eprint=

2024
[44]

2022 , eprint=

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding , author=. 2022 , eprint=

2022
[45]

2022 , eprint=

STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding , author=. 2022 , eprint=

2022
[46]

2025 , eprint=

Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge , author=. 2025 , eprint=

2025
[47]

2017 , eprint=

Spatio-temporal Person Retrieval via Natural Language Queries , author=. 2017 , eprint=

2017
[48]

2018 , eprint=

Object Referring in Videos with Language and Human Gaze , author=. 2018 , eprint=

2018
[49]

2018 , eprint=

Actor and Action Video Segmentation from a Sentence , author=. 2018 , eprint=

2018
[50]

2022 , eprint=

Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding , author=. 2022 , eprint=

2022
[51]

2024 , eprint=

Context-Guided Spatio-Temporal Video Grounding , author=. 2024 , eprint=

2024

[1] [1]

2026 , eprint=

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding , author=. 2026 , eprint=

2026

[2] [2]

2024 , eprint=

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models , author=. 2024 , eprint=

2024

[3] [3]

2025 , eprint=

OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination , author=. 2025 , eprint=

2025

[4] [4]

Proceedings of the 2019 on International Conference on Multimedia Retrieval , pages=

Annotating Objects and Relations in User-Generated Videos , author=. Proceedings of the 2019 on International Conference on Multimedia Retrieval , pages=. 2019 , organization=

2019

[5] [5]

2024 , eprint=

SAM 2: Segment Anything in Images and Videos , author=. 2024 , eprint=

2024

[6] [6]

2020 , eprint=

LaSOT: A High-quality Large-scale Single Object Tracking Benchmark , author=. 2020 , eprint=

2020

[7] [7]

GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild , volume=

Huang, Lianghua and Zhao, Xin and Huang, Kaiqi , year=. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild , volume=. IEEE Transactions on Pattern Analysis and Machine Intelligence , publisher=. doi:10.1109/tpami.2019.2957464 , number=

work page doi:10.1109/tpami.2019.2957464 2019

[8] [8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vspw: A large-scale dataset for video scene parsing in the wild , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[9] [9]

2023 , eprint=

SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes , author=. 2023 , eprint=

2023

[10] [10]

2018 , eprint=

TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild , author=. 2018 , eprint=

2018

[11] [11]

2020 , eprint=

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences , author=. 2020 , eprint=

2020

[12] [12]

2022 , eprint=

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion , author=. 2022 , eprint=

2022

[13] [13]

2025 , eprint=

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding , author=. 2025 , eprint=

2025

[14] [14]

2025 , eprint=

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability , author=. 2025 , eprint=

2025

[15] [15]

2023 , eprint=

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , eprint=

2023

[16] [16]

2024 , eprint=

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding , author=. 2024 , eprint=

2024

[17] [17]

2025 , eprint=

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs , author=. 2025 , eprint=

2025

[18] [18]

2026 , eprint=

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=

2026

[19] [19]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

2025

[20] [20]

2026 , eprint=

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents , author=. 2026 , eprint=

2026

[21] [21]

2025 , eprint=

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning , author=. 2025 , eprint=

2025

[22] [22]

2020 , eprint=

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding , author=. 2020 , eprint=

2020

[23] [23]

2025 , eprint=

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding , author=. 2025 , eprint=

2025

[24] [24]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Su, Rui and Yu, Qian and Xu, Dong , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2021 , pages =

2021

[25] [25]

2025 , eprint=

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding , author=. 2025 , eprint=

2025

[26] [26]

2025 , eprint=

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding , author=. 2025 , eprint=

2025

[27] [27]

2026 , eprint=

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding , author=. 2026 , eprint=

2026

[28] [28]

2026 , eprint=

TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding , author=. 2026 , eprint=

2026

[29] [29]

2025 , eprint=

1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning , author=. 2025 , eprint=

2025

[30] [30]

2025 , eprint=

Universal Video Temporal Grounding with Generative Multi-modal Large Language Models , author=. 2025 , eprint=

2025

[31] [31]

2026 , eprint=

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models , author=. 2026 , eprint=

2026

[32] [32]

2025 , eprint=

ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking , author=. 2025 , eprint=

2025

[33] [33]

2024 , eprint=

Elysium: Exploring Object-level Perception in Videos via MLLM , author=. 2024 , eprint=

2024

[34] [34]

2025 , eprint=

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos , author=. 2025 , eprint=

2025

[35] [35]

2026 , eprint=

Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence , author=. 2026 , eprint=

2026

[36] [36]

2025 , eprint=

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning , author=. 2025 , eprint=

2025

[37] [37]

2025 , eprint=

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning , author=. 2025 , eprint=

2025

[38] [38]

2026 , eprint=

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification , author=. 2026 , eprint=

2026

[39] [39]

2026 , eprint=

Vidi2.5: Large Multimodal Models for Video Understanding and Creation , author=. 2026 , eprint=

2026

[40] [40]

2026 , eprint=

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding , author=. 2026 , eprint=

2026

[41] [41]

2022 , eprint=

TubeDETR: Spatio-Temporal Video Grounding with Transformers , author=. 2022 , eprint=

2022

[42] [42]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

2026

[43] [43]

2024 , eprint=

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding , author=. 2024 , eprint=

2024

[44] [44]

2022 , eprint=

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding , author=. 2022 , eprint=

2022

[45] [45]

2022 , eprint=

STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding , author=. 2022 , eprint=

2022

[46] [46]

2025 , eprint=

Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge , author=. 2025 , eprint=

2025

[47] [47]

2017 , eprint=

Spatio-temporal Person Retrieval via Natural Language Queries , author=. 2017 , eprint=

2017

[48] [48]

2018 , eprint=

Object Referring in Videos with Language and Human Gaze , author=. 2018 , eprint=

2018

[49] [49]

2018 , eprint=

Actor and Action Video Segmentation from a Sentence , author=. 2018 , eprint=

2018

[50] [50]

2022 , eprint=

Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding , author=. 2022 , eprint=

2022

[51] [51]

2024 , eprint=

Context-Guided Spatio-Temporal Video Grounding , author=. 2024 , eprint=

2024