Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding

Antoni B. Chan; Guorong Li; Laiyun Qing; Qingming Huang; Ruixin Li; Xinyan Liu; Zelin Zheng

arxiv: 2605.21973 · v1 · pith:2QPMY645new · submitted 2026-05-21 · 💻 cs.CV

Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding

Zelin Zheng , Xinyan Liu , Ruixin Li , Antoni B. Chan , Guorong Li , Qingming Huang , Laiyun Qing This is my paper

Pith reviewed 2026-05-22 08:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords video temporal groundingVideo-LLMtemporal perceptionevidence-driven reasoningevent boundary predictionverifiable reasoningcandidate segment pool

0 comments

The pith

Foresee-to-Ground stabilizes video temporal grounding by building a citable evidence pool of candidate segments for the LLM to cite.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current Video-LLM methods generate timestamps directly from unstructured visual tokens, which often produces brittle and inconsistent boundaries for events in video. The paper introduces Foresee-to-Ground to treat the task instead as first identifying candidate events and then measuring their boundaries. It does this by learning temporal representations that form a video-wide pool of possible segments and presenting those segments to the language model as explicit evidence units it can cite. A sympathetic reader would care because the separation makes results more stable, verifiable, and transferable while keeping the model's broader video skills intact.

Core claim

F2G reformulates video temporal grounding as a verifiable Identify-then-Measure problem. It integrates Predictive Temporal Perception with Evidence-Driven Reasoning by learning boundary-sensitive temporal representations to build a video-wide evidence pool of candidate event segments and exposes these segments to the LLM as citable evidence units that bind boundary prediction to explicit event hypotheses. By decoupling event identification from precise boundary measurement, F2G stabilizes grounding and makes predictions verifiable.

What carries the argument

The video-wide evidence pool of candidate event segments, built from boundary-sensitive temporal representations and presented to the LLM as citable evidence units.

If this is right

Grounding accuracy improves consistently across diverse benchmarks.
The framework transfers robustly across different Video-LLM backbones.
General video understanding capabilities of the base model are preserved.
Predictions become verifiable because the LLM can cite specific segments from the evidence pool.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same identify-then-measure split could reduce fabricated timestamps in other long-form video reasoning tasks.
Making segments citable opens the door to human review or correction of the model's evidence before final boundary output.
The approach may scale better to videos with many overlapping events by keeping hypotheses explicit rather than implicit in token streams.
Similar evidence-pool mechanisms could be tested in audio or multi-modal temporal grounding settings.

Load-bearing premise

That building a video-wide pool of candidate event segments from boundary-sensitive representations and exposing them as citable units will bind the LLM's boundary predictions to explicit event hypotheses and stabilize results.

What would settle it

An ablation that removes the evidence pool while keeping all other components and finds no measurable drop in boundary precision or consistency on standard VTG benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.21973 by Antoni B. Chan, Guorong Li, Laiyun Qing, Qingming Huang, Ruixin Li, Xinyan Liu, Zelin Zheng.

**Figure 1.** Figure 1: Paradigms for VTG with Video-LLMs. (a) Direct timestamp generation; (b) Direct timestamp generation with timetext interface; (c) F2G: a verifiable grounding pipeline. as the predominant backbone for VTG. However, the prevailing paradigm treats grounding as a direct timestamp regression problem. This formulation is architecturally misaligned, as it compels LLMs to map flattened visual tokens onto contin… view at source ↗

**Figure 2.** Figure 2: Foresee-to-Ground framework overview. Left: model architecture and LLM input sequence construction with evidence units. Right: three-stage training pipeline (Stage-1: predictive temporal perception pretraining; Stage-2: proposal warm-up; Stage-3 evidence-driven Identify-then-Measure fine-tuning via LoRA on the LLM). where sg(·) denotes stop-gradient. Since each local view X (v) l contains only partial temp… view at source ↗

**Figure 4.** Figure 4: Repeated-inference stability on ActivityNet-Captions. For visualization, we omit |∆IoU|,IoU ∈ [0, 0.02) and renormalize the remaining density to highlight tail behavior. (Qwen3-VL + FT), indicating higher accuracy with lower variance. We provide an additional failure decomposition analysis in Appendix F.2. Efficiency Overhead. F2G adds a compact evidence interface with modest compute and context cost. Se… view at source ↗

**Figure 3.** Figure 3: Analysis on special cases [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: PCA visualization of temporal representations. w/o TM: without TM, uses temporally pooled visual features, w/ TM: with TM, uses temporal module output features. Time colors points by temporal order, and Span colors points by ground-truth membership (orange: inside; blue: outside). Prediction-trained temporal module makes the embedding trajectory more temporally coherent and increases inside/outside separab… view at source ↗

**Figure 6.** Figure 6: Stage-wise diagnostics for evidence-driven grounding [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Detailed data flow of additional modules across the three training stages. E Additional Results E.1 Standalone Stage-2 Proposal Quality To evaluate the proposal head independently of downstream LLM reasoning, we report Stage-2-only metrics before Top-K evidence serialization. The proposal head produces dense per-timestep predictions, including an eventness score and a regressed temporal span. We report Cen… view at source ↗

**Figure 8.** Figure 8: Visualization of the Top-K proposal pool. For each video pair, we plot the Top-K candidate segments Tk (in seconds), ordered by proposal objectness (top to bottom). The resulting pool provides diverse, video-wide event hypotheses that are later serialized as evidence units for Identify→Measure grounding. and the IoU of the cited span, IoUcite = IoU(Tz, T ⋆ ), (21) where z is the model-cited index. We then … view at source ↗

**Figure 9.** Figure 9: Repeated-inference failure decomposition on ActivityNet-Captions. We decompose repeated decoding outcomes into consistent-miss (IoU1 = IoU2 = 0) and stochastic-collapse (exactly one run has IoU = 0), and further stratify collapse cases by the other run’s non-zero IoU [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Citation-gap distribution. We plot ∆IoU = IoUbest − IoUcite, where IoUbest is the best candidate IoU within the Top-K proposal pool and IoUcite is the IoU of the model-cited span. Most mass lies near ∆IoU = 0 (87.8% / 93.6% of queries have ∆IoU < 0.10 on ActivityNet-Captions / Charades-STA), indicating that citation is usually near-optimal given the pool; remaining failures are therefore more often constr… view at source ↗

read the original abstract

Current Video-LLM approaches for Video Temporal Grounding (VTG) typically rely on direct timestamp generation from an unstructured visual-token stream, often leading to brittle numerics and inconsistent boundaries. To address this, we propose Foresee-to-Ground (F2G), a framework that reformulates VTG as a verifiable Identify-then-Measure problem. F2G integrates Predictive Temporal Perception with Evidence-Driven Reasoning: it learns boundary-sensitive temporal representations to build a video-wide evidence pool of candidate event segments, and exposes these segments to the LLM as citable evidence units that bind boundary prediction to explicit event hypotheses. By decoupling event identification from precise boundary measurement, F2G stabilizes grounding and makes predictions verifiable. Extensive experiments demonstrate that F2G consistently improves grounding accuracy across diverse benchmarks, transfers robustly across different Video-LLM backbones, and preserves general video understanding capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

F2G reformulates video temporal grounding as an identify-then-measure task by building a pool of boundary-sensitive candidate segments that the LLM treats as citable evidence.

read the letter

The main thing to know is that this paper splits video temporal grounding into first spotting candidate events via learned boundary-sensitive features and then letting the LLM measure boundaries against those explicit segments. This replaces the usual direct timestamp output from raw visual tokens, which the authors say produces brittle and inconsistent results. The framework is called Foresee-to-Ground and the central move is turning the video into a pool of verifiable event units that the model can reference during reasoning. They report that this leads to steadier predictions and works across different Video-LLM backbones without hurting general video tasks. The experiments are described as showing consistent gains on several benchmarks. What the paper does well is give a clear motivation for the change and a straightforward way to make the grounding steps more inspectable. The evidence pool idea is a practical engineering step that could help in settings where you need to trace why a particular time range was chosen. The decoupling of identification from measurement follows logically from the stated problem of numeric instability. On the soft spots, the description stays fairly high-level on exactly how the perception module learns the boundary-sensitive representations or how the segments are tokenized for the LLM. If the gains depend heavily on extra training or specific hyper-parameters rather than the structure itself, that would reduce the impact. The abstract-level claims of robust transfer are plausible but would need the full tables and ablations to judge how large or reliable the improvements really are. This work is aimed at people building or fine-tuning Video-LLMs for temporal tasks like event localization. Readers who want incremental but concrete fixes to existing grounding pipelines will get the most out of the framework and the backbone-transfer results. It deserves a serious referee because the reformulation is testable and the reported outcomes are specific enough to check against baselines. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Foresee-to-Ground (F2G), a framework that reformulates Video Temporal Grounding (VTG) as a verifiable Identify-then-Measure problem. It integrates Predictive Temporal Perception, which learns boundary-sensitive temporal representations to construct a video-wide evidence pool of candidate event segments, with Evidence-Driven Reasoning in which the LLM operates over these segments as citable evidence units. The central idea is that decoupling event identification from precise boundary measurement stabilizes grounding predictions and renders them verifiable. The paper reports that F2G yields consistent accuracy gains across diverse benchmarks, transfers robustly to different Video-LLM backbones, and preserves general video-understanding performance.

Significance. If the reported gains and transfer results are reproducible, the work would offer a meaningful conceptual and practical advance for VTG. The explicit separation of identification from measurement, together with the use of discrete, citable evidence units, directly targets the brittleness of unstructured timestamp generation. The claimed backbone-agnostic transfer and preservation of general capabilities would make the approach attractive for deployment in existing Video-LLM pipelines.

major comments (2)

[§4] §4 (Experiments): the abstract and experimental claims assert consistent improvements and robust transfer, yet the provided description supplies no quantitative tables, baseline comparisons, error bars, or ablation results that isolate the contribution of the evidence-pool construction. Without these data it is impossible to assess whether the Identify-then-Measure reformulation is the load-bearing factor behind the reported gains.
[§3.1] §3.1 (Predictive Temporal Perception): the construction of the video-wide evidence pool from boundary-sensitive representations is described at a high level, but the precise loss functions, segment-sampling procedure, and mechanism that guarantees the segments are “citable evidence units” are not formalized. This leaves the verifiability claim under-specified and difficult to reproduce.

minor comments (2)

[§3.2] Clarify the exact interface between the perception module output and the LLM input tokens; a short pseudocode block or diagram annotation would remove ambiguity.
[Notation and Figures] Ensure that all newly introduced acronyms (F2G, VTG) are expanded on first use in the main text and that figure captions explicitly label the evidence pool and citable segments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving clarity and rigor, particularly regarding experimental evidence and formalization of key components. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (Experiments): the abstract and experimental claims assert consistent improvements and robust transfer, yet the provided description supplies no quantitative tables, baseline comparisons, error bars, or ablation results that isolate the contribution of the evidence-pool construction. Without these data it is impossible to assess whether the Identify-then-Measure reformulation is the load-bearing factor behind the reported gains.

Authors: We acknowledge that the experimental presentation in the reviewed version may have lacked sufficient detail in the excerpt provided. The full manuscript's Section 4 includes quantitative tables reporting consistent accuracy gains on benchmarks such as ActivityNet Captions, Charades-STA, and others, along with comparisons to prior Video-LLM baselines. To directly isolate the contribution of the evidence-pool construction and the Identify-then-Measure reformulation, we will add error bars, expanded baseline tables, and dedicated ablation studies in the revision. These additions will make the load-bearing role of the proposed decoupling explicit and reproducible. revision: yes
Referee: [§3.1] §3.1 (Predictive Temporal Perception): the construction of the video-wide evidence pool from boundary-sensitive representations is described at a high level, but the precise loss functions, segment-sampling procedure, and mechanism that guarantees the segments are “citable evidence units” are not formalized. This leaves the verifiability claim under-specified and difficult to reproduce.

Authors: We agree that greater formalization is required for reproducibility and to substantiate the verifiability claim. In the revised manuscript we will explicitly state the loss functions employed to learn boundary-sensitive temporal representations, provide the precise segment-sampling procedure used to populate the video-wide evidence pool, and formalize the mechanism by which segments become citable evidence units—specifically, how they are tokenized, referenced, and bound to event hypotheses within the LLM's reasoning trace. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is a methodological reformulation of Video Temporal Grounding into an Identify-then-Measure process that populates an evidence pool of candidate segments via learned boundary-sensitive representations and then exposes them to the LLM for reasoning. This structure is presented as an explicit design choice motivated by observed brittleness in direct timestamp generation, with claims of stabilization and verifiability following directly from the exposure of discrete citable units rather than raw tokens. No equations, parameters, or uniqueness theorems are shown to reduce to fitted inputs or self-citations by construction; the empirical results on benchmarks and backbone transfer are reported as external validation. The derivation chain remains self-contained against the stated assumptions without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete and based on high-level claims; full details on any fitted parameters or background assumptions are absent.

axioms (1)

domain assumption Decoupling identification from boundary measurement via an evidence pool will stabilize LLM-based grounding predictions
This premise underpins the entire Identify-then-Measure reformulation described in the abstract.

pith-pipeline@v0.9.0 · 5701 in / 1216 out tokens · 36171 ms · 2026-05-22T08:01:13.822946+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

[1]

CVPR , month =

Li, Hongyu and Chen, Jinyu and Wei, Ziyu and Huang, Shaofei and Hui, Tianrui and Gao, Jialin and Wei, Xiaoming and Liu, Si , title =. CVPR , month =. 2025 , pages =

work page 2025
[2]

CVPR , pages=

Interventional video grounding with dual contrastive learning , author=. CVPR , pages=

work page
[3]

Li, Yan and Ji, Bin and Shi, Xintian and Zhang, Jianguo and Kang, Bin and Wang, Limin , booktitle=

work page
[4]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Temporal segment networks for action recognition in videos , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

work page
[5]

ICCV , pages=

Temporal action detection with structured segment networks , author=. ICCV , pages=

work page
[6]

A Survey on Video Temporal Grounding With Multimodal Large Language Model , year=

Wu, Jianlong and Liu, Wei and Liu, Ye and Liu, Meng and Nie, Liqiang and Lin, Zhouchen and Chen, Chang Wen , journal=. A Survey on Video Temporal Grounding With Multimodal Large Language Model , year=

work page
[7]

Temporal Sentence Grounding in Videos: A Survey and Future Directions , year=

Zhang, Hao and Sun, Aixin and Jing, Wei and Zhou, Joey Tianyi , journal=. Temporal Sentence Grounding in Videos: A Survey and Future Directions , year=

work page
[8]

CVPR , pages=

System-status-aware adaptive network for online streaming video understanding , author=. CVPR , pages=

work page
[9]

ICCV , pages=

Fast video moment retrieval , author=. ICCV , pages=

work page
[10]

AAAI , volume =

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition , author =. AAAI , volume =

work page
[11]

Wang, Haibo and Xu, Zhiyang and Cheng, Yu and Diao, Shizhe and Zhou, Yufan and Cao, Yixin and Wang, Qifan and Ge, Weifeng and Huang, Lifu , booktitle =

work page
[12]

ECCV , pages=

Meta spatio-temporal debiasing for video scene graph generation , author=. ECCV , pages=

work page
[13]

CVPR , pages=

Hierarchical video-moment retrieval and step-captioning , author=. CVPR , pages=

work page
[14]

An Empirical Study of

Cao, Min and Bai, Yang and Zeng, Ziyin and Ye, Mang and Zhang, Min , booktitle=. An Empirical Study of

work page
[15]

CVPR , pages=

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval , author=. CVPR , pages=

work page
[16]

Yang, Antoine and Nagrani, Arsha and Seo, Paul Hongsuck and Miech, Antoine and Pont-Tuset, Jordi and Laptev, Ivan and Sivic, Josef and Schmid, Cordelia , booktitle=

work page
[17]

2024 , booktitle =

Long Qian and Juncheng Li and Yu Wu and Yaobo Ye and Hao Fei and Tat-Seng Chua and Yueting Zhuang and Siliang Tang , title =. 2024 , booktitle =

work page 2024
[18]

2025 , booktitle =

Yongxin Guo and Jingyu Liu and Mingda Li and Qingbin Liu and Xi Chen and Xiaoying Tang , title =. 2025 , booktitle =

work page 2025
[19]

2025 , pages =

Yongliang Wu and Xinting Hu and Yuyang Sun and Yizhou Zhou and Wenbo Zhu and Fengyun Rao and Bernt Schiele and Xu Yang , title =. 2025 , pages =

work page 2025
[20]

2024 , pages =

Shuhuai Ren and Linli Yao and Shicheng Li and Xu Sun and Lu Hou , title =. 2024 , pages =

work page 2024
[21]

2024 , pages =

Bin Huang and Xin Wang and Hong Chen and Zihan Song and Wenwu Zhu , title =. 2024 , pages =

work page 2024
[22]

2025 , booktitle =

Xiangyu Zeng and Kunchang Li and Chenting Wang and Xinhao Li and Tianxiang Jiang and Ziang Yan and Songze Li and Yansong Shi and Zhengrong Yue and Yi Wang and Yali Wang and Yu Qiao and Limin Wang , title =. 2025 , booktitle =

work page 2025
[23]

2025 , pages =

Yangliu Hu and Zikai Song and Na Feng and Yawei Luo and Junqing Yu and Yi-Ping Phoebe Chen and Wei Yang , title =. 2025 , pages =

work page 2025
[24]

2024 , pages =

Ming Nie and Dan Ding and Chunwei Wang and Yuanfan Guo and Jianhua Han and Hang Xu and Li Zhang , title =. 2024 , pages =

work page 2024
[25]

ACL , pages=

Groundinggpt: Language enhanced multi-modal grounding model , author=. ACL , pages=

work page
[26]

ECCV , pages =

LITA: Language Instructed Temporal-Localization Assistant , author=. ECCV , pages =

work page
[27]

arXiv preprint arXiv:2403.10228 , year=

Hawkeye: Training video-text llms for grounding text in videos , author=. arXiv preprint arXiv:2403.10228 , year=

work page arXiv
[28]

2025 , booktitle =

Guo, Yongxin and Liu, Jingyu and Li, Mingda and Cheng, Dingxin and Tang, Xiaoying and Sui, Dianbo and Liu, Qingbin and Chen, Xi and Zhao, Kevin , title =. 2025 , booktitle =

work page 2025
[29]

Ye Wang and Ziheng Wang and Boshen Xu and Yang Du and Kejun Lin and Zihan Xiao and Zihao Yue and Jianzhong Ju and Liang Zhang and Dingyi Yang and Xiangnan Fang and Zewen He and Zhenbo Luo and Wenxuan Wang and Junqi Lin and Jian Luan and Qin Jin , booktitle=

work page
[30]

arXiv preprint arXiv:2512.14698 , year=

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs , author=. arXiv preprint arXiv:2512.14698 , year=

work page arXiv
[31]

arXiv preprint arXiv:2506.03569 , year =

work page arXiv
[32]

CVPR , pages =

Xu, Dejing and Xiao, Jun and Zhao, Zhou and Shao, Jian and Xie, Di and Zhuang, Yueting , title =. CVPR , pages =

work page
[33]

and Rubinstein, Michael and Irani, Michal and Dekel, Tali , booktitle=

Benaim, Sagie and Ephrat, Ariel and Lang, Oran and Mosseri, Inbar and Freeman, William T. and Rubinstein, Michael and Irani, Michal and Dekel, Tali , booktitle=. SpeedNet: Learning the Speediness in Videos , year=

work page
[34]

CVPR , year =

Qian, Rui and Meng, Tianjian and Gong, Boqing and Yang, Ming-Hsuan and Wang, Huisheng and Belongie, Serge and Cui, Yin , title =. CVPR , year =

work page
[35]

Self-supervised Video Representation Learning by Context and Motion Decoupling , year=

Huang, Lianghua and Liu, Yu and Wang, Bin and Pan, Pan and Xu, Yinghui and Jin, Rong , booktitle=. Self-supervised Video Representation Learning by Context and Motion Decoupling , year=

work page
[36]

CVPR , year =

Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu , title =. CVPR , year =

work page
[37]

2022 , booktitle =

Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin , title =. 2022 , booktitle =

work page 2022
[38]

Bardes, Adrien and Garrido, Quentin and Ponce, Jean and Chen, Xinlei and Rabbat, Michael and LeCun, Yann and Assran, Mido and Ballas, Nicolas , journal =

work page
[39]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran and Adrien Bardes and David Fan and Quentin Garrido and Russell Howes and Mojtaba Komeili and Matthew Muckley and Ammar Rizvi and Claire Roberts and Koustuv Sinha and Artem Zholus and Sergio Arnaud and Abha Gejji and Ada Martin and Francois Robert Hogan and Daniel Dugas and Piotr Bojanowski and Vasil Khalidov and Patrick Labatut and Francisco ...

work page internal anchor Pith review Pith/arXiv arXiv
[40]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Randall Balestriero and Yann LeCun , title =. arXiv preprint arXiv:2511.08544 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[41]

2025 , pages =

Chaoyou Fu and Yuhan Dai and Yongdong Luo and Lei Li and Shuhuai Ren and Renrui Zhang and Zihan Wang and Chenyu Zhou and Yunhang Shen and Mengdan Zhang and Peixian Chen and Yanwei Li and Shaohui Lin and Sirui Zhao and Ke Li and Tong Xu and Xiawu Zheng and Enhong Chen and Caifeng Shan and Ran He and Xing Sun , title =. 2025 , pages =

work page 2025
[42]

2017 , pages =

Jiyang Gao and Chen Sun and Zhenheng Yang and Ram Nevatia , title =. 2017 , pages =

work page 2017
[43]

ICCV , pages=

Dense-captioning events in videos , author=. ICCV , pages=

work page
[44]

NeurIPS , pages =

Detecting Moments and Highlights in Videos via Natural Language Queries , author =. NeurIPS , pages =

work page
[45]

CVPR , pages=

Weakly supervised video moment retrieval from text queries , author=. CVPR , pages=

work page
[46]

Qwen3-VL Technical Report

Shuai Bai and Yuxuan Cai and Ruizhe Chen and Keqin Chen and Xionghui Chen and Zesen Cheng and Lianghao Deng and Wei Ding and Chang Gao and Chunjiang Ge and Wenbin Ge and Zhifang Guo and Qidong Huang and Jie Huang and Fei Huang and Binyuan Hui and Shutong Jiang and Zhaohai Li and Mingsheng Li and Mei Li and Kaixin Li and Zicheng Lin and Junyang Lin and Xue...

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Qwen2.5-VL Technical Report

Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Li, Feng and Zhang, Renrui and Zhang, Hao and Zhang, Yuanhan and Li, Bo and Li, Wei and Ma, Zejun and Li, Chunyuan , booktitle =

work page
[49]

2025 , issn =

Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan , journal =. 2025 , issn =

work page 2025
[50]

Visual Instruction Tuning , volume =

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , booktitle =. Visual Instruction Tuning , volume =

work page
[51]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

NeurIPS , volume =

Bridge the Modality and Capability Gaps in Vision-Language Model Selection , author =. NeurIPS , volume =

work page
[53]

2025 , issn =

Zhang, Yuanhan and Wu, Jinming and Li, Wei and Li, Bo and Ma, Zejun and Liu, Ziwei and Li, Chunyuan , journal =. 2025 , issn =

work page 2025
[54]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo

work page
[55]

and Ba, Jimmy , booktitle =

Kingma, Diederik P. and Ba, Jimmy , booktitle =

work page

[1] [1]

CVPR , month =

Li, Hongyu and Chen, Jinyu and Wei, Ziyu and Huang, Shaofei and Hui, Tianrui and Gao, Jialin and Wei, Xiaoming and Liu, Si , title =. CVPR , month =. 2025 , pages =

work page 2025

[2] [2]

CVPR , pages=

Interventional video grounding with dual contrastive learning , author=. CVPR , pages=

work page

[3] [3]

Li, Yan and Ji, Bin and Shi, Xintian and Zhang, Jianguo and Kang, Bin and Wang, Limin , booktitle=

work page

[4] [4]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Temporal segment networks for action recognition in videos , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

work page

[5] [5]

ICCV , pages=

Temporal action detection with structured segment networks , author=. ICCV , pages=

work page

[6] [6]

A Survey on Video Temporal Grounding With Multimodal Large Language Model , year=

Wu, Jianlong and Liu, Wei and Liu, Ye and Liu, Meng and Nie, Liqiang and Lin, Zhouchen and Chen, Chang Wen , journal=. A Survey on Video Temporal Grounding With Multimodal Large Language Model , year=

work page

[7] [7]

Temporal Sentence Grounding in Videos: A Survey and Future Directions , year=

Zhang, Hao and Sun, Aixin and Jing, Wei and Zhou, Joey Tianyi , journal=. Temporal Sentence Grounding in Videos: A Survey and Future Directions , year=

work page

[8] [8]

CVPR , pages=

System-status-aware adaptive network for online streaming video understanding , author=. CVPR , pages=

work page

[9] [9]

ICCV , pages=

Fast video moment retrieval , author=. ICCV , pages=

work page

[10] [10]

AAAI , volume =

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition , author =. AAAI , volume =

work page

[11] [11]

Wang, Haibo and Xu, Zhiyang and Cheng, Yu and Diao, Shizhe and Zhou, Yufan and Cao, Yixin and Wang, Qifan and Ge, Weifeng and Huang, Lifu , booktitle =

work page

[12] [12]

ECCV , pages=

Meta spatio-temporal debiasing for video scene graph generation , author=. ECCV , pages=

work page

[13] [13]

CVPR , pages=

Hierarchical video-moment retrieval and step-captioning , author=. CVPR , pages=

work page

[14] [14]

An Empirical Study of

Cao, Min and Bai, Yang and Zeng, Ziyin and Ye, Mang and Zhang, Min , booktitle=. An Empirical Study of

work page

[15] [15]

CVPR , pages=

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval , author=. CVPR , pages=

work page

[16] [16]

Yang, Antoine and Nagrani, Arsha and Seo, Paul Hongsuck and Miech, Antoine and Pont-Tuset, Jordi and Laptev, Ivan and Sivic, Josef and Schmid, Cordelia , booktitle=

work page

[17] [17]

2024 , booktitle =

Long Qian and Juncheng Li and Yu Wu and Yaobo Ye and Hao Fei and Tat-Seng Chua and Yueting Zhuang and Siliang Tang , title =. 2024 , booktitle =

work page 2024

[18] [18]

2025 , booktitle =

Yongxin Guo and Jingyu Liu and Mingda Li and Qingbin Liu and Xi Chen and Xiaoying Tang , title =. 2025 , booktitle =

work page 2025

[19] [19]

2025 , pages =

Yongliang Wu and Xinting Hu and Yuyang Sun and Yizhou Zhou and Wenbo Zhu and Fengyun Rao and Bernt Schiele and Xu Yang , title =. 2025 , pages =

work page 2025

[20] [20]

2024 , pages =

Shuhuai Ren and Linli Yao and Shicheng Li and Xu Sun and Lu Hou , title =. 2024 , pages =

work page 2024

[21] [21]

2024 , pages =

Bin Huang and Xin Wang and Hong Chen and Zihan Song and Wenwu Zhu , title =. 2024 , pages =

work page 2024

[22] [22]

2025 , booktitle =

Xiangyu Zeng and Kunchang Li and Chenting Wang and Xinhao Li and Tianxiang Jiang and Ziang Yan and Songze Li and Yansong Shi and Zhengrong Yue and Yi Wang and Yali Wang and Yu Qiao and Limin Wang , title =. 2025 , booktitle =

work page 2025

[23] [23]

2025 , pages =

Yangliu Hu and Zikai Song and Na Feng and Yawei Luo and Junqing Yu and Yi-Ping Phoebe Chen and Wei Yang , title =. 2025 , pages =

work page 2025

[24] [24]

2024 , pages =

Ming Nie and Dan Ding and Chunwei Wang and Yuanfan Guo and Jianhua Han and Hang Xu and Li Zhang , title =. 2024 , pages =

work page 2024

[25] [25]

ACL , pages=

Groundinggpt: Language enhanced multi-modal grounding model , author=. ACL , pages=

work page

[26] [26]

ECCV , pages =

LITA: Language Instructed Temporal-Localization Assistant , author=. ECCV , pages =

work page

[27] [27]

arXiv preprint arXiv:2403.10228 , year=

Hawkeye: Training video-text llms for grounding text in videos , author=. arXiv preprint arXiv:2403.10228 , year=

work page arXiv

[28] [28]

2025 , booktitle =

Guo, Yongxin and Liu, Jingyu and Li, Mingda and Cheng, Dingxin and Tang, Xiaoying and Sui, Dianbo and Liu, Qingbin and Chen, Xi and Zhao, Kevin , title =. 2025 , booktitle =

work page 2025

[29] [29]

Ye Wang and Ziheng Wang and Boshen Xu and Yang Du and Kejun Lin and Zihan Xiao and Zihao Yue and Jianzhong Ju and Liang Zhang and Dingyi Yang and Xiangnan Fang and Zewen He and Zhenbo Luo and Wenxuan Wang and Junqi Lin and Jian Luan and Qin Jin , booktitle=

work page

[30] [30]

arXiv preprint arXiv:2512.14698 , year=

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs , author=. arXiv preprint arXiv:2512.14698 , year=

work page arXiv

[31] [31]

arXiv preprint arXiv:2506.03569 , year =

work page arXiv

[32] [32]

CVPR , pages =

Xu, Dejing and Xiao, Jun and Zhao, Zhou and Shao, Jian and Xie, Di and Zhuang, Yueting , title =. CVPR , pages =

work page

[33] [33]

and Rubinstein, Michael and Irani, Michal and Dekel, Tali , booktitle=

Benaim, Sagie and Ephrat, Ariel and Lang, Oran and Mosseri, Inbar and Freeman, William T. and Rubinstein, Michael and Irani, Michal and Dekel, Tali , booktitle=. SpeedNet: Learning the Speediness in Videos , year=

work page

[34] [34]

CVPR , year =

Qian, Rui and Meng, Tianjian and Gong, Boqing and Yang, Ming-Hsuan and Wang, Huisheng and Belongie, Serge and Cui, Yin , title =. CVPR , year =

work page

[35] [35]

Self-supervised Video Representation Learning by Context and Motion Decoupling , year=

Huang, Lianghua and Liu, Yu and Wang, Bin and Pan, Pan and Xu, Yinghui and Jin, Rong , booktitle=. Self-supervised Video Representation Learning by Context and Motion Decoupling , year=

work page

[36] [36]

CVPR , year =

Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu , title =. CVPR , year =

work page

[37] [37]

2022 , booktitle =

Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin , title =. 2022 , booktitle =

work page 2022

[38] [38]

Bardes, Adrien and Garrido, Quentin and Ponce, Jean and Chen, Xinlei and Rabbat, Michael and LeCun, Yann and Assran, Mido and Ballas, Nicolas , journal =

work page

[39] [39]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran and Adrien Bardes and David Fan and Quentin Garrido and Russell Howes and Mojtaba Komeili and Matthew Muckley and Ammar Rizvi and Claire Roberts and Koustuv Sinha and Artem Zholus and Sergio Arnaud and Abha Gejji and Ada Martin and Francois Robert Hogan and Daniel Dugas and Piotr Bojanowski and Vasil Khalidov and Patrick Labatut and Francisco ...

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Randall Balestriero and Yann LeCun , title =. arXiv preprint arXiv:2511.08544 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

2025 , pages =

Chaoyou Fu and Yuhan Dai and Yongdong Luo and Lei Li and Shuhuai Ren and Renrui Zhang and Zihan Wang and Chenyu Zhou and Yunhang Shen and Mengdan Zhang and Peixian Chen and Yanwei Li and Shaohui Lin and Sirui Zhao and Ke Li and Tong Xu and Xiawu Zheng and Enhong Chen and Caifeng Shan and Ran He and Xing Sun , title =. 2025 , pages =

work page 2025

[42] [42]

2017 , pages =

Jiyang Gao and Chen Sun and Zhenheng Yang and Ram Nevatia , title =. 2017 , pages =

work page 2017

[43] [43]

ICCV , pages=

Dense-captioning events in videos , author=. ICCV , pages=

work page

[44] [44]

NeurIPS , pages =

Detecting Moments and Highlights in Videos via Natural Language Queries , author =. NeurIPS , pages =

work page

[45] [45]

CVPR , pages=

Weakly supervised video moment retrieval from text queries , author=. CVPR , pages=

work page

[46] [46]

Qwen3-VL Technical Report

Shuai Bai and Yuxuan Cai and Ruizhe Chen and Keqin Chen and Xionghui Chen and Zesen Cheng and Lianghao Deng and Wei Ding and Chang Gao and Chunjiang Ge and Wenbin Ge and Zhifang Guo and Qidong Huang and Jie Huang and Fei Huang and Binyuan Hui and Shutong Jiang and Zhaohai Li and Mingsheng Li and Mei Li and Kaixin Li and Zicheng Lin and Junyang Lin and Xue...

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Qwen2.5-VL Technical Report

Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Li, Feng and Zhang, Renrui and Zhang, Hao and Zhang, Yuanhan and Li, Bo and Li, Wei and Ma, Zejun and Li, Chunyuan , booktitle =

work page

[49] [49]

2025 , issn =

Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan , journal =. 2025 , issn =

work page 2025

[50] [50]

Visual Instruction Tuning , volume =

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , booktitle =. Visual Instruction Tuning , volume =

work page

[51] [51]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

NeurIPS , volume =

Bridge the Modality and Capability Gaps in Vision-Language Model Selection , author =. NeurIPS , volume =

work page

[53] [53]

2025 , issn =

Zhang, Yuanhan and Wu, Jinming and Li, Wei and Li, Bo and Ma, Zejun and Liu, Ziwei and Li, Chunyuan , journal =. 2025 , issn =

work page 2025

[54] [54]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo

work page

[55] [55]

and Ba, Jimmy , booktitle =

Kingma, Diederik P. and Ba, Jimmy , booktitle =

work page