pith. sign in

arxiv: 2605.21973 · v1 · pith:2QPMY645new · submitted 2026-05-21 · 💻 cs.CV

Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding

Pith reviewed 2026-05-22 08:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords video temporal groundingVideo-LLMtemporal perceptionevidence-driven reasoningevent boundary predictionverifiable reasoningcandidate segment pool
0
0 comments X

The pith

Foresee-to-Ground stabilizes video temporal grounding by building a citable evidence pool of candidate segments for the LLM to cite.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current Video-LLM methods generate timestamps directly from unstructured visual tokens, which often produces brittle and inconsistent boundaries for events in video. The paper introduces Foresee-to-Ground to treat the task instead as first identifying candidate events and then measuring their boundaries. It does this by learning temporal representations that form a video-wide pool of possible segments and presenting those segments to the language model as explicit evidence units it can cite. A sympathetic reader would care because the separation makes results more stable, verifiable, and transferable while keeping the model's broader video skills intact.

Core claim

F2G reformulates video temporal grounding as a verifiable Identify-then-Measure problem. It integrates Predictive Temporal Perception with Evidence-Driven Reasoning by learning boundary-sensitive temporal representations to build a video-wide evidence pool of candidate event segments and exposes these segments to the LLM as citable evidence units that bind boundary prediction to explicit event hypotheses. By decoupling event identification from precise boundary measurement, F2G stabilizes grounding and makes predictions verifiable.

What carries the argument

The video-wide evidence pool of candidate event segments, built from boundary-sensitive temporal representations and presented to the LLM as citable evidence units.

If this is right

  • Grounding accuracy improves consistently across diverse benchmarks.
  • The framework transfers robustly across different Video-LLM backbones.
  • General video understanding capabilities of the base model are preserved.
  • Predictions become verifiable because the LLM can cite specific segments from the evidence pool.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same identify-then-measure split could reduce fabricated timestamps in other long-form video reasoning tasks.
  • Making segments citable opens the door to human review or correction of the model's evidence before final boundary output.
  • The approach may scale better to videos with many overlapping events by keeping hypotheses explicit rather than implicit in token streams.
  • Similar evidence-pool mechanisms could be tested in audio or multi-modal temporal grounding settings.

Load-bearing premise

That building a video-wide pool of candidate event segments from boundary-sensitive representations and exposing them as citable units will bind the LLM's boundary predictions to explicit event hypotheses and stabilize results.

What would settle it

An ablation that removes the evidence pool while keeping all other components and finds no measurable drop in boundary precision or consistency on standard VTG benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.21973 by Antoni B. Chan, Guorong Li, Laiyun Qing, Qingming Huang, Ruixin Li, Xinyan Liu, Zelin Zheng.

Figure 1
Figure 1. Figure 1: Paradigms for VTG with Video-LLMs. (a) Direct timestamp generation; (b) Direct timestamp generation with time￾text interface; (c) F2G: a verifiable grounding pipeline. as the predominant backbone for VTG. However, the pre￾vailing paradigm treats grounding as a direct timestamp regression problem. This formulation is architecturally mis￾aligned, as it compels LLMs to map flattened visual to￾kens onto contin… view at source ↗
Figure 2
Figure 2. Figure 2: Foresee-to-Ground framework overview. Left: model architecture and LLM input sequence construction with evidence units. Right: three-stage training pipeline (Stage-1: predictive temporal perception pretraining; Stage-2: proposal warm-up; Stage-3 evidence-driven Identify-then-Measure fine-tuning via LoRA on the LLM). where sg(·) denotes stop-gradient. Since each local view X (v) l contains only partial temp… view at source ↗
Figure 4
Figure 4. Figure 4: Repeated-inference stability on ActivityNet-Captions. For visualization, we omit |∆IoU|,IoU ∈ [0, 0.02) and renormal￾ize the remaining density to highlight tail behavior. (Qwen3-VL + FT), indicating higher accuracy with lower variance. We provide an additional failure decomposition analysis in Appendix F.2. Efficiency Overhead. F2G adds a compact evidence in￾terface with modest compute and context cost. Se… view at source ↗
Figure 3
Figure 3. Figure 3: Analysis on special cases [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: PCA visualization of temporal representations. w/o TM: without TM, uses temporally pooled visual features, w/ TM: with TM, uses temporal module output features. Time colors points by temporal order, and Span colors points by ground-truth membership (orange: inside; blue: outside). Prediction-trained temporal module makes the embedding trajectory more temporally coherent and increases inside/outside separab… view at source ↗
Figure 6
Figure 6. Figure 6: Stage-wise diagnostics for evidence-driven grounding [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Detailed data flow of additional modules across the three training stages. E Additional Results E.1 Standalone Stage-2 Proposal Quality To evaluate the proposal head independently of downstream LLM reasoning, we report Stage-2-only metrics before Top-K evidence serialization. The proposal head produces dense per-timestep predictions, including an eventness score and a regressed temporal span. We report Cen… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of the Top-K proposal pool. For each video pair, we plot the Top-K candidate segments Tk (in seconds), ordered by proposal objectness (top to bottom). The resulting pool provides diverse, video-wide event hypotheses that are later serialized as evidence units for Identify→Measure grounding. and the IoU of the cited span, IoUcite = IoU(Tz, T ⋆ ), (21) where z is the model-cited index. We then … view at source ↗
Figure 9
Figure 9. Figure 9: Repeated-inference failure decomposition on ActivityNet-Captions. We decompose repeated decoding outcomes into consistent-miss (IoU1 = IoU2 = 0) and stochastic-collapse (exactly one run has IoU = 0), and further stratify collapse cases by the other run’s non-zero IoU [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Citation-gap distribution. We plot ∆IoU = IoUbest − IoUcite, where IoUbest is the best candidate IoU within the Top-K proposal pool and IoUcite is the IoU of the model-cited span. Most mass lies near ∆IoU = 0 (87.8% / 93.6% of queries have ∆IoU < 0.10 on ActivityNet-Captions / Charades-STA), indicating that citation is usually near-optimal given the pool; remaining failures are therefore more often constr… view at source ↗
read the original abstract

Current Video-LLM approaches for Video Temporal Grounding (VTG) typically rely on direct timestamp generation from an unstructured visual-token stream, often leading to brittle numerics and inconsistent boundaries. To address this, we propose Foresee-to-Ground (F2G), a framework that reformulates VTG as a verifiable Identify-then-Measure problem. F2G integrates Predictive Temporal Perception with Evidence-Driven Reasoning: it learns boundary-sensitive temporal representations to build a video-wide evidence pool of candidate event segments, and exposes these segments to the LLM as citable evidence units that bind boundary prediction to explicit event hypotheses. By decoupling event identification from precise boundary measurement, F2G stabilizes grounding and makes predictions verifiable. Extensive experiments demonstrate that F2G consistently improves grounding accuracy across diverse benchmarks, transfers robustly across different Video-LLM backbones, and preserves general video understanding capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Foresee-to-Ground (F2G), a framework that reformulates Video Temporal Grounding (VTG) as a verifiable Identify-then-Measure problem. It integrates Predictive Temporal Perception, which learns boundary-sensitive temporal representations to construct a video-wide evidence pool of candidate event segments, with Evidence-Driven Reasoning in which the LLM operates over these segments as citable evidence units. The central idea is that decoupling event identification from precise boundary measurement stabilizes grounding predictions and renders them verifiable. The paper reports that F2G yields consistent accuracy gains across diverse benchmarks, transfers robustly to different Video-LLM backbones, and preserves general video-understanding performance.

Significance. If the reported gains and transfer results are reproducible, the work would offer a meaningful conceptual and practical advance for VTG. The explicit separation of identification from measurement, together with the use of discrete, citable evidence units, directly targets the brittleness of unstructured timestamp generation. The claimed backbone-agnostic transfer and preservation of general capabilities would make the approach attractive for deployment in existing Video-LLM pipelines.

major comments (2)
  1. [§4] §4 (Experiments): the abstract and experimental claims assert consistent improvements and robust transfer, yet the provided description supplies no quantitative tables, baseline comparisons, error bars, or ablation results that isolate the contribution of the evidence-pool construction. Without these data it is impossible to assess whether the Identify-then-Measure reformulation is the load-bearing factor behind the reported gains.
  2. [§3.1] §3.1 (Predictive Temporal Perception): the construction of the video-wide evidence pool from boundary-sensitive representations is described at a high level, but the precise loss functions, segment-sampling procedure, and mechanism that guarantees the segments are “citable evidence units” are not formalized. This leaves the verifiability claim under-specified and difficult to reproduce.
minor comments (2)
  1. [§3.2] Clarify the exact interface between the perception module output and the LLM input tokens; a short pseudocode block or diagram annotation would remove ambiguity.
  2. [Notation and Figures] Ensure that all newly introduced acronyms (F2G, VTG) are expanded on first use in the main text and that figure captions explicitly label the evidence pool and citable segments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving clarity and rigor, particularly regarding experimental evidence and formalization of key components. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): the abstract and experimental claims assert consistent improvements and robust transfer, yet the provided description supplies no quantitative tables, baseline comparisons, error bars, or ablation results that isolate the contribution of the evidence-pool construction. Without these data it is impossible to assess whether the Identify-then-Measure reformulation is the load-bearing factor behind the reported gains.

    Authors: We acknowledge that the experimental presentation in the reviewed version may have lacked sufficient detail in the excerpt provided. The full manuscript's Section 4 includes quantitative tables reporting consistent accuracy gains on benchmarks such as ActivityNet Captions, Charades-STA, and others, along with comparisons to prior Video-LLM baselines. To directly isolate the contribution of the evidence-pool construction and the Identify-then-Measure reformulation, we will add error bars, expanded baseline tables, and dedicated ablation studies in the revision. These additions will make the load-bearing role of the proposed decoupling explicit and reproducible. revision: yes

  2. Referee: [§3.1] §3.1 (Predictive Temporal Perception): the construction of the video-wide evidence pool from boundary-sensitive representations is described at a high level, but the precise loss functions, segment-sampling procedure, and mechanism that guarantees the segments are “citable evidence units” are not formalized. This leaves the verifiability claim under-specified and difficult to reproduce.

    Authors: We agree that greater formalization is required for reproducibility and to substantiate the verifiability claim. In the revised manuscript we will explicitly state the loss functions employed to learn boundary-sensitive temporal representations, provide the precise segment-sampling procedure used to populate the video-wide evidence pool, and formalize the mechanism by which segments become citable evidence units—specifically, how they are tokenized, referenced, and bound to event hypotheses within the LLM's reasoning trace. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is a methodological reformulation of Video Temporal Grounding into an Identify-then-Measure process that populates an evidence pool of candidate segments via learned boundary-sensitive representations and then exposes them to the LLM for reasoning. This structure is presented as an explicit design choice motivated by observed brittleness in direct timestamp generation, with claims of stabilization and verifiability following directly from the exposure of discrete citable units rather than raw tokens. No equations, parameters, or uniqueness theorems are shown to reduce to fitted inputs or self-citations by construction; the empirical results on benchmarks and backbone transfer are reported as external validation. The derivation chain remains self-contained against the stated assumptions without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete and based on high-level claims; full details on any fitted parameters or background assumptions are absent.

axioms (1)
  • domain assumption Decoupling identification from boundary measurement via an evidence pool will stabilize LLM-based grounding predictions
    This premise underpins the entire Identify-then-Measure reformulation described in the abstract.

pith-pipeline@v0.9.0 · 5701 in / 1216 out tokens · 36171 ms · 2026-05-22T08:01:13.822946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

  1. [1]

    CVPR , month =

    Li, Hongyu and Chen, Jinyu and Wei, Ziyu and Huang, Shaofei and Hui, Tianrui and Gao, Jialin and Wei, Xiaoming and Liu, Si , title =. CVPR , month =. 2025 , pages =

  2. [2]

    CVPR , pages=

    Interventional video grounding with dual contrastive learning , author=. CVPR , pages=

  3. [3]

    Li, Yan and Ji, Bin and Shi, Xintian and Zhang, Jianguo and Kang, Bin and Wang, Limin , booktitle=

  4. [4]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Temporal segment networks for action recognition in videos , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

  5. [5]

    ICCV , pages=

    Temporal action detection with structured segment networks , author=. ICCV , pages=

  6. [6]

    A Survey on Video Temporal Grounding With Multimodal Large Language Model , year=

    Wu, Jianlong and Liu, Wei and Liu, Ye and Liu, Meng and Nie, Liqiang and Lin, Zhouchen and Chen, Chang Wen , journal=. A Survey on Video Temporal Grounding With Multimodal Large Language Model , year=

  7. [7]

    Temporal Sentence Grounding in Videos: A Survey and Future Directions , year=

    Zhang, Hao and Sun, Aixin and Jing, Wei and Zhou, Joey Tianyi , journal=. Temporal Sentence Grounding in Videos: A Survey and Future Directions , year=

  8. [8]

    CVPR , pages=

    System-status-aware adaptive network for online streaming video understanding , author=. CVPR , pages=

  9. [9]

    ICCV , pages=

    Fast video moment retrieval , author=. ICCV , pages=

  10. [10]

    AAAI , volume =

    Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition , author =. AAAI , volume =

  11. [11]

    Wang, Haibo and Xu, Zhiyang and Cheng, Yu and Diao, Shizhe and Zhou, Yufan and Cao, Yixin and Wang, Qifan and Ge, Weifeng and Huang, Lifu , booktitle =

  12. [12]

    ECCV , pages=

    Meta spatio-temporal debiasing for video scene graph generation , author=. ECCV , pages=

  13. [13]

    CVPR , pages=

    Hierarchical video-moment retrieval and step-captioning , author=. CVPR , pages=

  14. [14]

    An Empirical Study of

    Cao, Min and Bai, Yang and Zeng, Ziyin and Ye, Mang and Zhang, Min , booktitle=. An Empirical Study of

  15. [15]

    CVPR , pages=

    Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval , author=. CVPR , pages=

  16. [16]

    Yang, Antoine and Nagrani, Arsha and Seo, Paul Hongsuck and Miech, Antoine and Pont-Tuset, Jordi and Laptev, Ivan and Sivic, Josef and Schmid, Cordelia , booktitle=

  17. [17]

    2024 , booktitle =

    Long Qian and Juncheng Li and Yu Wu and Yaobo Ye and Hao Fei and Tat-Seng Chua and Yueting Zhuang and Siliang Tang , title =. 2024 , booktitle =

  18. [18]

    2025 , booktitle =

    Yongxin Guo and Jingyu Liu and Mingda Li and Qingbin Liu and Xi Chen and Xiaoying Tang , title =. 2025 , booktitle =

  19. [19]

    2025 , pages =

    Yongliang Wu and Xinting Hu and Yuyang Sun and Yizhou Zhou and Wenbo Zhu and Fengyun Rao and Bernt Schiele and Xu Yang , title =. 2025 , pages =

  20. [20]

    2024 , pages =

    Shuhuai Ren and Linli Yao and Shicheng Li and Xu Sun and Lu Hou , title =. 2024 , pages =

  21. [21]

    2024 , pages =

    Bin Huang and Xin Wang and Hong Chen and Zihan Song and Wenwu Zhu , title =. 2024 , pages =

  22. [22]

    2025 , booktitle =

    Xiangyu Zeng and Kunchang Li and Chenting Wang and Xinhao Li and Tianxiang Jiang and Ziang Yan and Songze Li and Yansong Shi and Zhengrong Yue and Yi Wang and Yali Wang and Yu Qiao and Limin Wang , title =. 2025 , booktitle =

  23. [23]

    2025 , pages =

    Yangliu Hu and Zikai Song and Na Feng and Yawei Luo and Junqing Yu and Yi-Ping Phoebe Chen and Wei Yang , title =. 2025 , pages =

  24. [24]

    2024 , pages =

    Ming Nie and Dan Ding and Chunwei Wang and Yuanfan Guo and Jianhua Han and Hang Xu and Li Zhang , title =. 2024 , pages =

  25. [25]

    ACL , pages=

    Groundinggpt: Language enhanced multi-modal grounding model , author=. ACL , pages=

  26. [26]

    ECCV , pages =

    LITA: Language Instructed Temporal-Localization Assistant , author=. ECCV , pages =

  27. [27]

    arXiv preprint arXiv:2403.10228 , year=

    Hawkeye: Training video-text llms for grounding text in videos , author=. arXiv preprint arXiv:2403.10228 , year=

  28. [28]

    2025 , booktitle =

    Guo, Yongxin and Liu, Jingyu and Li, Mingda and Cheng, Dingxin and Tang, Xiaoying and Sui, Dianbo and Liu, Qingbin and Chen, Xi and Zhao, Kevin , title =. 2025 , booktitle =

  29. [29]

    Ye Wang and Ziheng Wang and Boshen Xu and Yang Du and Kejun Lin and Zihan Xiao and Zihao Yue and Jianzhong Ju and Liang Zhang and Dingyi Yang and Xiangnan Fang and Zewen He and Zhenbo Luo and Wenxuan Wang and Junqi Lin and Jian Luan and Qin Jin , booktitle=

  30. [30]

    arXiv preprint arXiv:2512.14698 , year=

    TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs , author=. arXiv preprint arXiv:2512.14698 , year=

  31. [31]

    arXiv preprint arXiv:2506.03569 , year =

  32. [32]

    CVPR , pages =

    Xu, Dejing and Xiao, Jun and Zhao, Zhou and Shao, Jian and Xie, Di and Zhuang, Yueting , title =. CVPR , pages =

  33. [33]

    and Rubinstein, Michael and Irani, Michal and Dekel, Tali , booktitle=

    Benaim, Sagie and Ephrat, Ariel and Lang, Oran and Mosseri, Inbar and Freeman, William T. and Rubinstein, Michael and Irani, Michal and Dekel, Tali , booktitle=. SpeedNet: Learning the Speediness in Videos , year=

  34. [34]

    CVPR , year =

    Qian, Rui and Meng, Tianjian and Gong, Boqing and Yang, Ming-Hsuan and Wang, Huisheng and Belongie, Serge and Cui, Yin , title =. CVPR , year =

  35. [35]

    Self-supervised Video Representation Learning by Context and Motion Decoupling , year=

    Huang, Lianghua and Liu, Yu and Wang, Bin and Pan, Pan and Xu, Yinghui and Jin, Rong , booktitle=. Self-supervised Video Representation Learning by Context and Motion Decoupling , year=

  36. [36]

    CVPR , year =

    Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu , title =. CVPR , year =

  37. [37]

    2022 , booktitle =

    Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin , title =. 2022 , booktitle =

  38. [38]

    Bardes, Adrien and Garrido, Quentin and Ponce, Jean and Chen, Xinlei and Rabbat, Michael and LeCun, Yann and Assran, Mido and Ballas, Nicolas , journal =

  39. [39]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran and Adrien Bardes and David Fan and Quentin Garrido and Russell Howes and Mojtaba Komeili and Matthew Muckley and Ammar Rizvi and Claire Roberts and Koustuv Sinha and Artem Zholus and Sergio Arnaud and Abha Gejji and Ada Martin and Francois Robert Hogan and Daniel Dugas and Piotr Bojanowski and Vasil Khalidov and Patrick Labatut and Francisco ...

  40. [40]

    LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

    Randall Balestriero and Yann LeCun , title =. arXiv preprint arXiv:2511.08544 , year =

  41. [41]

    2025 , pages =

    Chaoyou Fu and Yuhan Dai and Yongdong Luo and Lei Li and Shuhuai Ren and Renrui Zhang and Zihan Wang and Chenyu Zhou and Yunhang Shen and Mengdan Zhang and Peixian Chen and Yanwei Li and Shaohui Lin and Sirui Zhao and Ke Li and Tong Xu and Xiawu Zheng and Enhong Chen and Caifeng Shan and Ran He and Xing Sun , title =. 2025 , pages =

  42. [42]

    2017 , pages =

    Jiyang Gao and Chen Sun and Zhenheng Yang and Ram Nevatia , title =. 2017 , pages =

  43. [43]

    ICCV , pages=

    Dense-captioning events in videos , author=. ICCV , pages=

  44. [44]

    NeurIPS , pages =

    Detecting Moments and Highlights in Videos via Natural Language Queries , author =. NeurIPS , pages =

  45. [45]

    CVPR , pages=

    Weakly supervised video moment retrieval from text queries , author=. CVPR , pages=

  46. [46]

    Qwen3-VL Technical Report

    Shuai Bai and Yuxuan Cai and Ruizhe Chen and Keqin Chen and Xionghui Chen and Zesen Cheng and Lianghao Deng and Wei Ding and Chang Gao and Chunjiang Ge and Wenbin Ge and Zhifang Guo and Qidong Huang and Jie Huang and Fei Huang and Binyuan Hui and Shutong Jiang and Zhaohai Li and Mingsheng Li and Mei Li and Kaixin Li and Zicheng Lin and Junyang Lin and Xue...

  47. [47]

    Qwen2.5-VL Technical Report

    Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...

  48. [48]

    Li, Feng and Zhang, Renrui and Zhang, Hao and Zhang, Yuanhan and Li, Bo and Li, Wei and Ma, Zejun and Li, Chunyuan , booktitle =

  49. [49]

    2025 , issn =

    Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan , journal =. 2025 , issn =

  50. [50]

    Visual Instruction Tuning , volume =

    Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , booktitle =. Visual Instruction Tuning , volume =

  51. [51]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

  52. [52]

    NeurIPS , volume =

    Bridge the Modality and Capability Gaps in Vision-Language Model Selection , author =. NeurIPS , volume =

  53. [53]

    2025 , issn =

    Zhang, Yuanhan and Wu, Jinming and Li, Wei and Li, Bo and Ma, Zejun and Liu, Ziwei and Li, Chunyuan , journal =. 2025 , issn =

  54. [54]

    Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo

  55. [55]

    and Ba, Jimmy , booktitle =

    Kingma, Diederik P. and Ba, Jimmy , booktitle =