pith. sign in

arxiv: 2605.19075 · v1 · pith:EOEDF7EXnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

Pith reviewed 2026-05-20 10:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multimodal video question answeringkey-frame targetingclaim verificationcritic loopnews event analysisevidence attributionMAGMaR benchmarkautomatic speech recognition
0
0 comments X

The pith

CRAFT refines claims from video keyframes with a critic loop to improve accuracy in multi-video news question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CRAFT as a pipeline that selects dynamic keyframes from multiple videos, transcribes speech with language fallback, and runs claims through a hybrid critic process to verify and fix them before final consolidation. This setup aims to surface query-relevant evidence while linking every fact to its exact video sources. A reader would care because it tackles the practical problem of grounding answers in real news video archives without unsupported statements. The method outperforms baselines on the MAGMaR 2026 benchmark and holds up on a converted WikiVideo set, with ablations pointing to atomic claims, audio transcripts, and the critic steps as the main drivers of improvement.

Core claim

CRAFT is a query-conditioned pipeline combining dynamic key-frame selection, per-video ASR with multilingual fallback, and a hybrid critic loop that uses UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator to iteratively verify and repair claims before emitting each fact once with all supporting source identifiers; on MAGMaR 2026 it records the highest overall average of 0.739, reference recall of 0.810, and citation F1 of 0.635, while also scoring 0.823 average on a MAGMaR-style WikiVideo conversion.

What carries the argument

The hybrid critic loop that iteratively verifies and repairs claims using temporal entailment checks, cross-claim screening, and an adjudicator model before citation merging.

If this is right

  • Atomic claims, automatic speech recognition, and the critic loop each contribute measurable gains over a vanilla query-conditioned baseline.
  • The same claim-centric aggregation approach generalizes to non-overlapping event queries on a converted WikiVideo dataset.
  • Final citation merging ensures each fact appears once with all supporting source identifiers attached.
  • The pipeline supports multilingual audio handling through fallback transcription.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Iterative claim repair may reduce unsupported statements in other multimodal evidence tasks such as timeline reconstruction from mixed video sources.
  • The emphasis on source attribution could extend to live news monitoring systems that need traceable facts across incoming video streams.
  • If the critic components scale with larger models, the method might support higher-volume archives without proportional increase in manual review.

Load-bearing premise

The hybrid critic loop can verify and repair claims without introducing new errors or biases.

What would settle it

Running the full pipeline on a fresh collection of news video queries and finding that the critic loop fails to raise reference recall or citation F1 above a simple query-conditioned baseline would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.19075 by Abdul Wasi, Akhil Gorugantu, David Doermann, Mahesh Bhosale, Pengyu Yan, Vishvesh Trivedi.

Figure 1
Figure 1. Figure 1: Overview of CRAFT. Given a persona, query, and relevant videos, CRAFT builds a query-specific multimodal evidence stream: each video is transcribed once via ASR, and Dynamic Keyframe Selection (DKS) selects the frames most relevant to the query. The base VLM consumes the persona, query, transcript, and sampled frames to produce atomic claims, which are refined by a hybrid critic loop—UNLI for temporal grou… view at source ↗
read the original abstract

Grounded multi-video question answering over real-world news events requires systems to surface query-relevant evidence across heterogeneous video archives while attributing every claim to its supporting source. We introduce CRAFT (Critic-Refined Adaptive Key-Frame Targeting), a query-conditioned pipeline that combines dynamic keyframe selection, per-video ASR with multilingual fallback, and a hybrid critic loop to iteratively verify and repair claims before consolidation. The pipeline integrates UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator, with a final citation-merging stage that emits each fact once with all supporting source identifiers. On MAGMaR 2026, CRAFT achieves the best overall average (0.739), reference recall (0.810), and citation F1 (0.635). We further evaluate on a MAGMaR-style conversion of WikiVideo with 52 non-overlapping event queries, where CRAFT also performs strongly (0.823 Avg), showing that its claim-centric evidence aggregation generalizes beyond MAGMaR. Ablations show that atomic claims, ASR, and the critic loop drive the main gains over the vanilla query-conditioned baseline. Code and implementation details are publicly available at https://github.com/bhosalems/CRAFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CRAFT, a query-conditioned pipeline for grounded multi-video question answering over news events. It integrates dynamic key-frame selection, per-video ASR with multilingual fallback, and a hybrid critic loop (UNLI temporal entailment + DeBERTa-v3 cross-claim screening + Llama-3.2-3B adjudicator) to iteratively verify and repair claims before final citation merging. On the MAGMaR 2026 benchmark the system reports the highest overall average (0.739), reference recall (0.810), and citation F1 (0.635); a MAGMaR-style conversion of WikiVideo with 52 event queries yields 0.823 average. Ablations attribute the main gains to atomic claims, ASR, and the critic loop.

Significance. If the critic loop can be shown to improve claim quality without injecting new errors or biases, the pipeline would offer a practical advance in evidence-grounded multimodal QA with explicit source attribution. Public code release supports reproducibility. The current evaluation, however, leaves the reliability of the critic component itself unmeasured, weakening attribution of the headline metrics.

major comments (2)
  1. [Ablation studies] Ablation studies: gains are attributed to the critic loop, yet no separate precision/recall figures, error analysis on repaired versus original claims, or bias checks (e.g., Llama-3.2-3B source-type preference) are reported. End-to-end metrics alone cannot confirm that the observed 0.810 reference recall and 0.635 citation F1 arise from accurate repairs rather than other stages or silent failures.
  2. [Experiments on MAGMaR 2026] MAGMaR 2026 results: the central performance claims (0.739 avg, 0.810 recall, 0.635 F1) rest on the unverified assumption that the iterative critic loop (UNLI + DeBERTa-v3 + Llama adjudicator) never drops valid evidence or fabricates citations. Without direct measurement of the loop's own accuracy, the headline numbers remain difficult to interpret.
minor comments (1)
  1. [Abstract] The abstract and evaluation sections should state the exact number of videos and queries in the primary MAGMaR 2026 test set to allow readers to assess scale and statistical power.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify the need for more granular evaluation of the critic loop, and we address each point below while committing to revisions that strengthen the attribution of our results.

read point-by-point responses
  1. Referee: [Ablation studies] Ablation studies: gains are attributed to the critic loop, yet no separate precision/recall figures, error analysis on repaired versus original claims, or bias checks (e.g., Llama-3.2-3B source-type preference) are reported. End-to-end metrics alone cannot confirm that the observed 0.810 reference recall and 0.635 citation F1 arise from accurate repairs rather than other stages or silent failures.

    Authors: We thank the referee for this observation. Our ablations demonstrate that ablating the critic loop reduces performance on reference recall and citation F1, supporting its contribution alongside atomic claims and ASR. However, we agree that end-to-end metrics alone leave room for ambiguity about whether gains stem from accurate repairs. We will add dedicated precision/recall figures for the critic loop, an error analysis contrasting repaired versus original claims, and bias checks on the Llama-3.2-3B adjudicator (including source-type preferences) in the revised manuscript. revision: yes

  2. Referee: [Experiments on MAGMaR 2026] MAGMaR 2026 results: the central performance claims (0.739 avg, 0.810 recall, 0.635 F1) rest on the unverified assumption that the iterative critic loop (UNLI + DeBERTa-v3 + Llama adjudicator) never drops valid evidence or fabricates citations. Without direct measurement of the loop's own accuracy, the headline numbers remain difficult to interpret.

    Authors: We acknowledge the referee's concern that direct accuracy measurement of the critic loop would make the headline metrics easier to interpret. The hybrid design (UNLI temporal entailment, DeBERTa-v3 screening, and Llama-3.2-3B adjudication) is intended to minimize both dropped evidence and fabricated citations, and the consistent gains across MAGMaR 2026 and the WikiVideo conversion support its net positive effect. To strengthen this, we will include a direct accuracy evaluation of the critic loop on a held-out sample of claims in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity in CRAFT's empirical pipeline and benchmark results

full rationale

The paper describes a query-conditioned pipeline for multimodal video QA evaluated on external datasets (MAGMaR 2026 and a MAGMaR-style WikiVideo conversion). Performance numbers (0.739 avg, 0.810 recall, 0.635 F1) are reported as direct experimental outcomes rather than predictions derived from fitted parameters or equations within the paper. Ablations attribute gains to components like the critic loop, but these are empirical comparisons on held-out data, not self-definitional reductions or renamings of inputs. No mathematical derivation chain, uniqueness theorems, or ansatzes are invoked that collapse back to the paper's own fitted values or self-citations. The system is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical engineering pipeline with no mathematical derivations or explicit free parameters visible in the abstract; it relies on off-the-shelf models (UNLI, DeBERTa-v3, Llama-3.2-3B) whose internal parameters are inherited from prior training.

pith-pipeline@v0.9.0 · 5786 in / 1204 out tokens · 44541 ms · 2026-05-20T10:41:34.673704+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 9 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    Proceedings of the 2024 ACM conference on fairness, accountability, and transparency , pages=

    Careless whisper: Speech-to-text hallucination harms , author=. Proceedings of the 2024 ACM conference on fairness, accountability, and transparency , pages=

  8. [8]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  9. [9]

    Qwen2.5-

    Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...

  10. [10]

    Bai, Shuai and Cai, Yuxuan and Chen, Ruizhe and Chen, Keqin and Chen, Xionghui and Cheng, Zesen and Deng, Lianghao and Ding, Wei and Gao, Chang and Ge, Chunjiang and Ge, Wenbin and Guo, Zhifang and Huang, Qidong and Huang, Jie and Huang, Fei and Hui, Binyuan and Jiang, Shutong and Li, Zhaohai and Li, Mingsheng and Li, Mei and Li, Kaixin and Lin, Zicheng a...

  11. [11]

    Zhu, Jinguo and Wang, Weiyun and Chen, Zhe and Liu, Zhaoyang and Ye, Shenglong and Gu, Lixin and Tian, Hao and Duan, Yuchen and Su, Weijie and Shao, Jie and Gao, Zhangwei and Cui, Erfei and Cao, Yue and Liu, Yangzhou and Wei, Xingguang and Zhang, Hongjie and Wang, Haomin and Xu, Weiye and Li, Hao and Wang, Jiahao and others , journal =

  12. [12]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya , title =. 2022 , copyright =. doi:10.48550/ARXIV.2212.04356 , url =

  13. [13]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Video Instruction Tuning with Synthetic Data , author =. arXiv preprint arXiv:2410.02713 , year =

  14. [14]

    Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Li, Yanwei and Liu, Ziwei and Li, Chunyuan , journal =

  15. [15]

    arXiv preprint arXiv:2504.00939 , year=

    Wikivideo: Article generation from multiple videos , author=. arXiv preprint arXiv:2504.00939 , year=

  16. [16]

    and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas , journal =

    Shen, Xiaoqian and Xiong, Yunyang and Zhao, Changsheng and Wu, Lemeng and Chen, Jun and Zhu, Chenchen and Liu, Zechun and Xiao, Fanyi and Varadarajan, Balakrishnan and Bordes, Florian and Liu, Zhuang and Xu, Hu and Kim, Hyunwoo J. and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas , journal =

  17. [17]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  18. [18]

    Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Liang, Zhengyang and Huang, Tiejun and Zhao, Bo , journal =

  19. [19]

    2024 , pages =

    Song, Enxin and Chai, Wenhao and Wang, Guanhong and Zhang, Yucheng and Zhou, Haoyang and Wu, Feiyang and Chi, Haozhe and Guo, Xun and Ye, Tian and Zhang, Yanting and Lu, Yan and Hwang, Jenq-Neng and Wang, Gaoang , booktitle =. 2024 , pages =

  20. [20]

    He, Bo and Li, Hengduo and Jang, Young Kyun and Jia, Menglin and Cao, Xuefei and Shah, Ashish and Shrivastava, Abhinav and Lim, Ser-Nam , booktitle =

  21. [21]

    Qwen3-ASR Technical Report

    Qwen3-ASR Technical Report , author=. arXiv preprint arXiv:2601.21337 , year=

  22. [22]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Adaptive Keyframe Sampling for Long Video Understanding , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  23. [23]

    2025 , note =

    Gao, Hong and Wang, Yiming and Hu, Xin and Cao, Xun and Tao, Mingkui , journal =. 2025 , note =

  24. [24]

    Uncertain Natural Language Inference

    Chen, Tongfei and Jiang, Zhengping and Poliak, Adam and Sakaguchi, Keisuke and Van Durme, Benjamin. Uncertain Natural Language Inference. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.774

  25. [25]

    2025 , pages =

    Wang, Ziyang and Yu, Shoubin and Stengel-Eskin, Elias and Yoon, Jaehong and Cheng, Feng and Bertasius, Gedas and Bansal, Mohit , booktitle =. 2025 , pages =

  26. [26]

    2025 , note =

    Sun, Hui and Lu, Shiyin and Wang, Huanyu and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu and Li, Ming , journal =. 2025 , note =

  27. [27]

    2025 , note =

    Zhang, Shaojie and Yang, Jiahui and Yin, Jianqin and Luo, Zhenbo and Luan, Jian , journal =. 2025 , note =

  28. [28]

    Pengcheng He and Jianfeng Gao and Weizhu Chen , year=. De. International Conference on Learning Representations (

  29. [29]

    Zhang, Xian and Wu, Zexi and Li, Zinuo and Xu, Hongming and Gong, Luqi and Boussaid, Farid and Werghi, Naoufel and Bennamoun, Mohammed , journal =

  30. [30]

    International Conference on Learning Representations (ICLR) , year=

    BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations (ICLR) , year=

  31. [31]

    Text Summarization Branches Out: Proceedings of the ACL-04 Workshop , pages=

    ROUGE: A Package for Automatic Evaluation of Summaries , author=. Text Summarization Branches Out: Proceedings of the ACL-04 Workshop , pages=

  32. [32]

    Ragas: Automated Evaluation of Retrieval Augmented Generation

    RAGAs: Automated Evaluation of Retrieval Augmented Generation , author=. arXiv preprint arXiv:2309.15217 , year=

  33. [33]

    Unified Multimodal Uncertain Inference

    Unified Multimodal Uncertain Inference , author=. arXiv preprint arXiv:2604.08701 , year=

  34. [34]

    arXiv preprint arXiv:2510.24870 , year=

    Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation , author=. arXiv preprint arXiv:2510.24870 , year=

  35. [35]

    A Simple

    Zhang, Ce and Lu, Taixi and Islam, Md Mohaiminul and Wang, Ziyang and Yu, Shoubin and Bansal, Mohit and Bertasius, Gedas , booktitle =. A Simple. 2024 , pages =

  36. [36]

    Wang, Xiaohan and Zhang, Yuhui and Zohar, Orr and Yeung-Levy, Serena , booktitle =

  37. [37]

    2024 , pages =

    Min, Juhong and Buch, Shyamal and Nagrani, Arsha and Cho, Minsu and Schmid, Cordelia , booktitle =. 2024 , pages =

  38. [38]

    Zhi, Zhuo and Wu, Qiangqiang and Shen, Minghe and Li, Wenbo and Li, Yinchuan and Shao, Kun and Zhou, Kaiwen , journal =

  39. [39]

    Deep video discovery: Agentic search with tool use for long-form video understanding.arXiv preprint arXiv:2505.18079, 2025

    Deep Video Discovery: Agentic Search with Tool Use for Long-Form Video Understanding , author =. arXiv preprint arXiv:2505.18079 , year =

  40. [40]

    Yuan, Huaying and Liu, Zheng and Zhou, Junjie and Qian, Hongjin and Wen, Ji-Rong and Dou, Zhicheng , journal =

  41. [42]

    2024 , url=

    Llama 3.2: 1B and 3B Instruct Model Card , author=. 2024 , url=

  42. [43]

    Asai, Akari and Wu, Zeqiu and Wang, Yizhong and Sil, Avirup and Hajishirzi, Hannaneh , booktitle =. Self-

  43. [44]

    Corrective Retrieval Augmented Generation

    Corrective Retrieval Augmented Generation , author =. arXiv preprint arXiv:2401.15884 , year =

  44. [45]

    Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng , booktitle =

  45. [46]

    Dang, Jisheng and Song, Huilin and Xiao, Junbin and Wang, Bimei and Peng, Han and Li, Haoxuan and Yang, Xun and Wang, Meng and Chua, Tat-Seng , journal =

  46. [47]

    Qwen3-Omni Technical Report

    Qwen3-Omni Technical Report , author=. arXiv preprint arXiv:2509.17765 , year=

  47. [48]

    Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) , year =

    Correctness is not Faithfulness in Retrieval Augmented Generation Attributions , author =. Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) , year =

  48. [49]

    Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

    Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding , author=. arXiv preprint arXiv:2601.10611 , year=

  49. [50]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

  50. [51]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  51. [52]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year =

    MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year =

  52. [53]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  53. [54]

    Wang, Yuxuan and Wang, Yueqian and Zhao, Dongyan and Xie, Cihang and Zheng, Zilong , journal =

  54. [55]

    Zhang, Jiacheng and Jiao, Yang and Chen, Shaoxiang and Zhao, Na and Tan, Zhiyu and Li, Hao and Ma, Xingjun and Chen, Jingjing , journal =

  55. [56]

    2025 , pages =

    Li, Chaoyu and Im, Eun Woo and Fazli, Pooyan , booktitle =. 2025 , pages =

  56. [57]

    Kriz, Reno and Sanders, Kate and Etter, David and Murray, Kenton and Carpenter, Cameron and Van Ochten, Kelly and Recknor, Hannah and Guallar-Blasco, Jimena and Martin, Alexander and Colaianni, Ronald and King, Nolan and Yang, Eugene and Van Durme, Benjamin , booktitle =

  57. [58]

    Samuel, Saron and DeGenaro, Dan and Guallar-Blasco, Jimena and Sanders, Kate and Eisape, Oluwaseun and Spendlove, Tanner and Reddy, Arun and Martin, Alexander and Yates, Andrew and Yang, Eugene and Carpenter, Cameron and Etter, David and Kayi, Efsun and Wiesner, Matthew and Murray, Kenton and Kriz, Reno , booktitle =

  58. [59]

    2025 , pages =

    Jeong, Soyeong and Kim, Kangsan and Baek, Jinheon and Hwang, Sung Ju , booktitle =. 2025 , pages =

  59. [60]

    Ren, Xubin and Xu, Lingrui and Xia, Long and Wang, Shuaiqiang and Yin, Dawei and Huang, Chao , journal =

  60. [61]

    Zeng, Nianbo and Hou, Haowen and Yu, Fei Richard and Shi, Si and He, Ying Tiffany , journal =

  61. [62]

    arXiv preprint arXiv:2510.02262 , year =

    From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding , author =. arXiv preprint arXiv:2510.02262 , year =

  62. [63]

    arXiv preprint arXiv:2407.15047 , year =

    End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling , author =. arXiv preprint arXiv:2407.15047 , year =

  63. [64]

    Zou, Yuanhao and Liu, Yifan and Liu, Yang and Zhang, Yifan and Zhang, Han and Chen, Chen , journal =

  64. [65]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Re-Thinking Temporal Search for Long-Form Video Understanding , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =