Recognition: unknown
OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
Pith reviewed 2026-05-07 16:47 UTC · model grok-4.3
The pith
A large-scale open-world video dataset built via iterative concept expansion and a self-correction CoT training paradigm lets MLLMs ground rare concepts and reach zero-shot SOTA on existing VTG benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OmniVTG supplies a large-scale open-world VTG dataset constructed through a Semantic Coverage Iterative Expansion pipeline that identifies vocabulary gaps and collects matching videos, annotated by a caption-centric engine that prompts MLLMs for dense timestamped descriptions; paired with this, a Self-Correction CoT training paradigm proceeds in three stages (SFT, CoT fine-tuning, and reinforcement learning) so that the model first predicts grounding, then uses its superior video-understanding capacity to reflect on and refine its own output, yielding strong open-world performance and zero-shot gains on prior benchmarks.
What carries the argument
Self-Correction Chain-of-Thought (CoT) training paradigm, in which the MLLM generates an initial grounding prediction, reflects on it using its video-understanding strengths, and revises the timestamps before final output.
If this is right
- The rare-common performance gap narrows because the model learns to leverage its stronger understanding ability to correct grounding errors.
- Zero-shot transfer improves on four existing VTG benchmarks because the training instills generalizable reflection rather than dataset-specific patterns.
- The caption-centric annotation engine produces higher-quality labels for rare concepts than direct grounding prompts would have achieved.
- The three-stage pipeline (SFT followed by CoT fine-tuning followed by RL) can be applied to other MLLM video tasks where understanding exceeds direct prediction.
Where Pith is reading between the lines
- The iterative expansion method could be reused to grow datasets for other open-world video tasks such as action detection or video question answering.
- If the self-correction loop is applied at inference time rather than only during training, further gains on rare concepts may appear without additional labeled data.
- The same caption-then-reflect strategy might reduce annotation cost in any domain where MLLMs already excel at description but not at structured output.
Load-bearing premise
Modern multimodal large language models are reliably better at producing dense timestamped video captions than at directly outputting accurate grounding timestamps.
What would settle it
Training an MLLM on the same OmniVTG data with ordinary supervised fine-tuning only, without the CoT reflection stage or reinforcement learning, leaves a large performance gap between rare and common concepts on held-out OmniVTG queries.
Figures
read the original abstract
Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to prompt MLLMs to generate dense, timestamped descriptions. Beyond the dataset, we observe that simple supervised finetuning (SFT) is insufficient, as a performance gap between rare and common concepts still persists. We find that MLLMs' video understanding ability significantly surpasses their direct grounding ability. Based on this, we propose a Self-Correction Chain-of-Thought (CoT) training paradigm. We train the MLLM to first predict, then use its understanding capabilities to reflect on and refine its own predictions. This capability is instilled via a three-stage pipeline of SFT, CoT finetuning, and reinforcement learning. Extensive experiments show our approach not only excels at open-world grounding in our OmniVTG dataset but also achieves state-of-the-art zero-shot performance on four existing VTG benchmarks. Code is available at https://github.com/oceanflowlab/OmniVTG.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OmniVTG, a large-scale open-world video temporal grounding dataset constructed via Semantic Coverage Iterative Expansion to address vocabulary gaps in prior datasets, paired with a caption-centric data engine that prompts MLLMs to produce dense timestamped descriptions. It further proposes a Self-Correction Chain-of-Thought training paradigm (SFT followed by CoT finetuning and RL) to leverage MLLMs' video understanding strengths for improved grounding, claiming superior open-world performance on OmniVTG and SOTA zero-shot results on four existing VTG benchmarks.
Significance. If the central claims hold after validation, the work would be significant for scaling VTG to rare concepts and providing a reproducible training recipe for MLLMs; the public code release and focus on data diversity are clear strengths that could enable follow-on research in open-world video understanding.
major comments (2)
- [§3.2] §3.2 (Caption-Centric Data Engine): The core assumption that modern MLLMs excel at dense captioning over direct grounding is used to justify the annotation pipeline for rare concepts, yet the manuscript reports no quantitative validation such as human agreement scores, timestamp localization error distributions, or an ablation comparing caption-derived vs. direct-grounding annotations on a held-out rare-concept subset. This directly affects the reliability of the OmniVTG training data and the downstream SOTA zero-shot transfer claims.
- [§4] §4 (Experiments): The claim of SOTA zero-shot performance on four external VTG benchmarks and superiority on rare concepts is load-bearing, but the section provides insufficient detail on baseline implementations, per-concept breakdowns (rare vs. common), and ablations isolating the three-stage CoT pipeline; without these, it is difficult to attribute gains to the proposed method rather than dataset scale alone.
minor comments (2)
- [§3.3] The notation for the three-stage training pipeline (SFT, CoT, RL) could be clarified with an explicit diagram or equation showing how the self-correction loss is formulated.
- [Figure 3] Figure 3 (dataset statistics) would benefit from an additional panel showing the distribution of rare concepts across video lengths to support the open-world claim.
Simulated Author's Rebuttal
We thank the referee for the thorough and constructive review. We address each major comment point-by-point below, acknowledging where the original manuscript was insufficient and outlining the revisions we will make.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Caption-Centric Data Engine): The core assumption that modern MLLMs excel at dense captioning over direct grounding is used to justify the annotation pipeline for rare concepts, yet the manuscript reports no quantitative validation such as human agreement scores, timestamp localization error distributions, or an ablation comparing caption-derived vs. direct-grounding annotations on a held-out rare-concept subset. This directly affects the reliability of the OmniVTG training data and the downstream SOTA zero-shot transfer claims.
Authors: We appreciate the referee highlighting this validation gap. The assumption originated from our internal development observations that MLLMs produced more reliable dense, timestamped captions than direct grounding outputs for rare concepts. However, we agree that the manuscript should have included explicit quantitative support. In the revised version, we will add a dedicated subsection (and appendix) reporting a human evaluation on a held-out subset of annotations, including agreement scores and timestamp error distributions. We will also include an ablation directly comparing caption-centric annotations against direct MLLM grounding on rare-concept samples to demonstrate the pipeline's reliability and its contribution to the zero-shot results. revision: yes
-
Referee: [§4] §4 (Experiments): The claim of SOTA zero-shot performance on four external VTG benchmarks and superiority on rare concepts is load-bearing, but the section provides insufficient detail on baseline implementations, per-concept breakdowns (rare vs. common), and ablations isolating the three-stage CoT pipeline; without these, it is difficult to attribute gains to the proposed method rather than dataset scale alone.
Authors: We agree that the experimental section requires greater transparency to substantiate the claims. The original manuscript followed common reproduction practices but did not provide sufficient granularity. In the revision, we will substantially expand Section 4 with: detailed baseline implementation descriptions and hyperparameter settings; per-concept breakdowns separating rare and common concepts on all four benchmarks; and ablations that isolate each stage of the three-stage Self-Correction CoT pipeline (SFT, CoT finetuning, and RL). These additions will clarify the source of gains beyond dataset scale and include statistical analysis where appropriate. revision: yes
Circularity Check
No significant circularity; dataset construction and training paradigm are independently motivated and externally evaluated
full rationale
The paper's derivation chain consists of an empirical observation about MLLM strengths in dense captioning (used to motivate a caption-centric annotation pipeline for the new OmniVTG dataset), followed by a three-stage training procedure (SFT, CoT finetuning, RL) to instill self-correction. These steps produce a new dataset and model, with performance claims resting on direct evaluation against the OmniVTG test set and zero-shot transfer to four independent external VTG benchmarks. No equations, fitted parameters, or self-citations reduce the central claims to self-definition or construction by fiat. The motivating assumption about MLLM captioning vs. grounding is presented as an observed empirical fact rather than a derived result, and the external benchmarks supply independent validation. The approach is therefore self-contained with no load-bearing circular reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Modern MLLMs excel at dense captioning more than direct grounding
Reference graph
Works this paper leans on
-
[1]
Localizing mo- ments in video with natural language
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language. InProceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017. 4
2017
-
[2]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...
2025
-
[3]
A novel approach to text detection and ex- traction from videos by discriminative features and density
WEI Baogang, ZHANG Yin, YUAN Jie, LIU Yonghuai, and W ANG Lidong. A novel approach to text detection and ex- traction from videos by discriminative features and density. Chinese Journal of Electronics, 23(2):322–328, 2014. 1
2014
-
[4]
Lo- calizing moments in long video via multimodal guidance
Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos- Arroyo, Fabian Caba Heilbron, and Bernard Ghanem. Lo- calizing moments in long video via multimodal guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13667–13678, 2023. 3
2023
-
[5]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, and et.al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 1, 2, 5
2025
-
[6]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 4
work page internal anchor Pith review arXiv 2018
-
[7]
Tall: Temporal activity localization via language query
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on com- puter vision, pages 5267–5275, 2017. 1, 2, 3, 4, 7
2017
-
[8]
Trace: Temporal grounding video llm via causal event modeling
Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Qing- bin Liu, and Xi Chen. Trace: Temporal grounding video llm via causal event modeling. 2025. 3, 7
2025
-
[9]
Multimodal cross-attention mechanism-based algorithm for elderly be- havior monitoring and recognition.Chinese Journal of Elec- tronics, 34(1):309–321, 2025
Liu Hao, Feng Zhiquan, and Guo Qingbei. Multimodal cross-attention mechanism-based algorithm for elderly be- havior monitoring and recognition.Chinese Journal of Elec- tronics, 34(1):309–321, 2025. 3
2025
-
[10]
Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, and Nan Duan. Cone: An efficient coarse-to-fine alignment frame- work for long video temporal grounding.arXiv preprint arXiv:2209.10918, 2022. 3
-
[11]
Vtimellm: Empower llm to grasp video moments
Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14271–14280, 2024. 3, 7
2024
-
[12]
Dense-captioning events in videos
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In International Conference on Computer Vision (ICCV), 2017. 1, 2, 3, 4, 5, 7
2017
-
[13]
Detecting mo- ments and highlights in videos via natural language queries
Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting mo- ments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34: 11846–11858, 2021. 2, 3, 4, 5, 7
2021
-
[14]
Berg, and Mohit Bansal
Jie Lei, Tamara L. Berg, and Mohit Bansal. Qvhighlights: detecting moments and highlights in videos via natural lan- guage queries. InProceedings of the 35th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2021. Curran Associates Inc. 3
2021
-
[15]
Mo- mentdiff: Generative video moment retrieval from random to real.Advances in neural information processing systems, 36, 2024
Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun Zheng, Deli Zhao, and Yongdong Zhang. Mo- mentdiff: Generative video moment retrieval from random to real.Advances in neural information processing systems, 36, 2024. 3
2024
-
[16]
Videochat-flash: Hierarchical compression for long-context video modeling,
Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chent- ing Wang, Yu Qiao, Yali Wang, and Limin Wang. Videochat- flash: Hierarchical compression for long-context video mod- eling.arXiv preprint arXiv:2501.00574, 2024. 7
-
[17]
Universal video temporal grounding with generative multi-modal large language mod- els
Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, and Weidi Xie. Universal video temporal grounding with generative multi-modal large language mod- els. InNeurIPS, 2025. 1, 2, 3, 7
2025
-
[18]
Towards balanced alignment: Modal-enhanced semantic modeling for video moment re- trieval
Zhihang Liu, Jun Li, Hongtao Xie, Pandeng Li, Jiannan Ge, Sun-Ao Liu, and Guoqing Jin. Towards balanced alignment: Modal-enhanced semantic modeling for video moment re- trieval. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3855–3863, 2024. 3
2024
-
[19]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019. 3
2019
-
[20]
Advancing 3d scene understanding with mv- scanqa multi-view reasoning evaluation and tripalign pre- training dataset
Wentao Mo, Qingchao Chen, Yuxin Peng, Siyuan Huang, and Yang Liu. Advancing 3d scene understanding with mv- scanqa multi-view reasoning evaluation and tripalign pre- training dataset. InProceedings of the 33rd ACM Inter- national Conference on Multimedia, pages 12973–12980,
-
[21]
Snag: Scalable and accurate video grounding
Fangzhou Mu, Sicheng Mo, and Yin Li. Snag: Scalable and accurate video grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18930–18940, 2024. 3
2024
-
[22]
Local- global video-text interactions for temporal grounding
Jonghwan Mun, Minsu Cho, and Bohyung Han. Local- global video-text interactions for temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10810–10819, 2020. 3
2020
-
[23]
Henriques, Yang Liu, An- drew Zisserman, and Samuel Albanie
Andreea-Maria Oncescu, Jo ˜ao F. Henriques, Yang Liu, An- drew Zisserman, and Samuel Albanie. Queryd: A video dataset with high-quality text and audio narrations, 2021. 2, 3, 4
2021
-
[24]
Yulin Pan, Xiangteng He, Biao Gong, Yiliang Lv, Yujun Shen, Yuxin Peng, and Deli Zhao. Scanning only once: An end-to-end framework for fast temporal grounding in long videos.arXiv preprint arXiv:2303.08345, 2023. 3
-
[25]
Chatvtg: Video temporal grounding via chat with video dialogue large language models
Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1847–1856, 2024. 7
2024
-
[26]
Ground- ing action descriptions in videos.Transactions of the Asso- ciation for Computational Linguistics, 1:25–36, 2013
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Ground- ing action descriptions in videos.Transactions of the Asso- ciation for Computational Linguistics, 1:25–36, 2013. 3, 4
2013
-
[27]
Timechat: A time-sensitive multimodal large lan- guage model for long video understanding
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 3, 7
2024
-
[28]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5, 6
work page internal anchor Pith review arXiv 2024
-
[29]
Mad: A scalable dataset for language grounding in videos from movie audio descriptions
Mattia Soldan, Alejandro Pardo, Juan Le ´on Alc´azar, Fabian Caba, Chen Zhao, Silvio Giancola, and Bernard Ghanem. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5026–5035, 2022. 2, 3, 4, 5
2022
-
[30]
Hawkeye: Training video-text llms for grounding text in videos,
Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video- text llms for grounding text in videos.arXiv preprint arXiv:2403.10228, 2024. 7
-
[31]
Yuxuan Wang, Yiqi Song, Cihang Xie, Yang Liu, and Zi- long Zheng. Videollamb: Long streaming video under- standing with recurrent memory bridges.arXiv preprint arXiv:2409.01071, 2024. 3
-
[32]
Time-r1: Post-training large vision language model for temporal video grounding,
Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, and Qin Jin. Time-r1: Post- training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 1, 2, 3, 5, 6, 7
-
[33]
Negative sample matters: A renaissance of met- ric learning for temporal grounding
Zhenzhi Wang, Limin Wang, Tao Wu, Tianhao Li, and Gang- shan Wu. Negative sample matters: A renaissance of met- ric learning for temporal grounding. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2613– 2623, 2022. 3
2022
-
[34]
Bridging the gap: A unified video comprehension framework for mo- ment retrieval and highlight detection
Yicheng Xiao, Zhuoyan Luo, Yong Liu, Yue Ma, Heng- wei Bian, Yatai Ji, Yujiu Yang, and Xiu Li. Bridging the gap: A unified video comprehension framework for mo- ment retrieval and highlight detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18709–18719, 2024. 3
2024
-
[35]
Ego4d: Around the world in 3,000 hours of egocentric video
Mengmeng Xu, Chen Zhao, Merey Ramazanova, and Bernard Ghanem. Ego4d: Around the world in 3,000 hours of egocentric video. In2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18973– 18990. IEEE, 2022. 4
2022
-
[36]
Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, and Yang Liu. 3d vision and language pre- training with large-scale synthetic data.arXiv preprint arXiv:2407.06084, 2024. 1
-
[37]
Ar-vrm: Imitating human motions for visual robot manipulation with analogical reasoning
Dejie Yang, Zijing Zhao, and Yang Liu. Ar-vrm: Imitating human motions for visual robot manipulation with analogical reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6818–6827, 2025. 3
2025
-
[38]
Planllm: Video procedure planning with refinable large language models
Dejie Yang, Zijing Zhao, and Yang Liu. Planllm: Video procedure planning with refinable large language models. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 9166–9174, 2025
2025
-
[39]
A survey on fine-grained multi- modal large language models.Chinese Journal of Electron- ics, 35(2):1–33, 2026
Peng Yuxin, Wang Zishuo, Li Geng, Zheng Xiangtian, Yin Sibo, and He Hulingxiao. A survey on fine-grained multi- modal large language models.Chinese Journal of Electron- ics, 35(2):1–33, 2026. 3
2026
-
[40]
Hierarchical video-moment retrieval and step-captioning
Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Bar- las O ˘guz, Yashar Mehdad, and Mohit Bansal. Hierarchical video-moment retrieval and step-captioning. InCVPR, 2023. 4
2023
-
[41]
Timesuite: Improving mllms for long video understanding via grounded tuning
Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhen- grong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 3, 7
2025
-
[42]
Distime: Distribution- based time representation for video large language models
Yingsen Zeng, Zepeng Huang, Yujie Zhong, Chengjian Feng, Jie Hu, Lin Ma, and Yang Liu. Distime: Distribution- based time representation for video large language models. arXiv preprint arXiv:2505.24329, 2025. 3
-
[43]
Learning 2d temporal adjacent networks for moment local- ization with natural language
Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. Learning 2d temporal adjacent networks for moment local- ization with natural language. InProceedings of the AAAI Conference on Artificial Intelligence, pages 12870–12877,
-
[44]
Training-free video temporal grounding using large-scale pre-trained models
Minghang Zheng, Xinhao Cai, Qingchao Chen, Yuxin Peng, and Yang Liu. Training-free video temporal grounding using large-scale pre-trained models. InEuropean Conference on Computer Vision, pages 20–37. Springer, 2024. 3
2024
-
[45]
Minghang Zheng, Yanjie Huang, Qingchao Chen, Yuxin Peng, and Yang Liu. Weakly and single-frame super- vised temporal sentence grounding with gaussian-based con- trastive proposal learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3
2025
-
[46]
Hierarchical event memory for accurate and low- latency online video temporal grounding
Minghang Zheng, Yuxin Peng, Benyuan Sun, Yi Yang, and Yang Liu. Hierarchical event memory for accurate and low- latency online video temporal grounding. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 21589–21599, 2025. 3
2025
-
[47]
Gala-2.5d: Global-local alignment with 2.5d semantic guidance for camera-based 3d semantic scene completion in autonomous driving.Chinese Journal of Electronics, 35(2):1–12, 2026
Yang Zhiwen and Peng Yuxin. Gala-2.5d: Global-local alignment with 2.5d semantic guidance for camera-based 3d semantic scene completion in autonomous driving.Chinese Journal of Electronics, 35(2):1–12, 2026. 1
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.