Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

Baoxing Huai; Nicholas Jing Yuan; Zhijie Lin; Zhou Zhao; Zhu Zhang

arxiv: 2008.06941 · v2 · pith:63KKYZTInew · submitted 2020-08-16 · 💻 cs.CV · cs.MM

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

Zhu Zhang , Zhou Zhao , Zhijie Lin , Baoxing Huai , Nicholas Jing Yuan This is my paper

classification 💻 cs.CV cs.MM

keywords objectrelationgroundingmulti-branchobject-awarespatio-temporalbranchvideo

0 comments

read the original abstract

Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence. Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences. This challenging task requires to capture critical object relations to identify the queried target. However, existing approaches cannot distinguish notable objects and remain in ineffective relation modeling between unnecessary objects. Thus, we propose a novel object-aware multi-branch relation network for object-aware relation discovery. Concretely, we first devise multiple branches to develop object-aware region modeling, where each branch focuses on a crucial object mentioned in the sentence. We then propose multi-branch relation reasoning to capture critical object relationships between the main branch and auxiliary branches. Moreover, we apply a diversity loss to make each branch only pay attention to its corresponding object and boost multi-branch learning. The extensive experiments show the effectiveness of our proposed method.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Knowledge-Preserved Model Tuning in Null-Space for Robust Spatio-Temporal Video Grounding
cs.CV 2026-06 unverdicted novelty 7.0

Null-Space Tuning injects learnable residuals into input features confined to the null-space for high-quality inputs to preserve pre-trained knowledge while directing restoration components for low-quality inputs outs...
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
cs.CV 2026-04 unverdicted novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
Subjective Portrait Region Cropping in Landscape Videos with Temporal Annotation Smoothing
cs.CV 2026-04 unverdicted novelty 6.0

A new large-scale subjective database for video portrait region cropping with temporal smoothing, benchmarked using existing models and compared to saliency predictions.