HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

HaoPeng Zhang; Mengqi Shi

arxiv: 2605.19223 · v1 · pith:A6HWBVDMnew · submitted 2026-05-19 · 💻 cs.CV

HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

Mengqi Shi , Haopeng Zhang This is my paper

Pith reviewed 2026-05-20 07:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords hierarchical alignmentmultimodal benchmarkvideo understandingcross-modal alignmenttemporal reasoningvideo summarizationMLLM evaluation

0 comments

The pith

HAVEN creates a benchmark with continuous video-text alignment at frame, shot, and video levels to test unified multimodal understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing video benchmarks split evaluation into disconnected pieces such as single keyframes or separate text summaries, which breaks the connected way stories unfold across levels. The paper argues this fragmentation hides whether models can maintain alignment from small details up to overall meaning. HAVEN builds a dataset with explicit alignments linking frames to shots to full videos, each paired with matching text descriptions. The evaluation then checks models on summarization, temporal reasoning, grounding, and saliency using this connected structure. If the approach holds, it shows many current models produce fluent text while missing the grounded connections needed for complex narratives.

Core claim

HAVEN pioneers a fully granular and fully multimodal dataset architecture, complete with explicit, continuous alignment between modalities at frame, shot, and video levels. Built upon this unified annotation paradigm, the work proposes a comprehensive evaluation suite spanning summarization, temporal reasoning, multimodal grounding, and saliency ranking. Benchmarking of state-of-the-art MLLMs exposes a persistent gap between surface-level textual fluency and grounded multimodal understanding.

What carries the argument

The hierarchically aligned multimodal dataset architecture that supplies explicit continuous alignment between video and text across frame, shot, and video levels.

If this is right

Models can now be checked for consistency when moving from frame details to shot sequences to full-video meaning.
Evaluation moves past single-answer formats to include continuous cross-level alignment checks.
The benchmark supplies a standardized way to measure progress toward interpretable hierarchical video understanding.
Future model development can target the exposed gap between fluent text output and actual multimodal grounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model training could incorporate similar hierarchical alignment signals to close the observed performance gap.
The same multi-granularity structure might transfer to other sequential data such as audio streams or long documents.
Real-world video tools for search or editing might gain reliability by adopting evaluation that tracks alignment across levels.

Load-bearing premise

That existing benchmarks fragment supervision across isolated granularities and therefore cannot capture the hierarchical structure of cross-modal alignment.

What would settle it

If top MLLMs score as highly on the new hierarchical tasks as on standard isolated benchmarks and show no measurable gap in grounded understanding, the claim that the unified paradigm is required would be challenged.

Figures

Figures reproduced from arXiv: 2605.19223 by HaoPeng Zhang, Mengqi Shi.

**Figure 1.** Figure 1: Comparison between different MLLMs on HAVEN across different capabilities. We include text-proxy LLMs as baselines. Video Understanding Benchmarks. Recent benchmarks have advanced the evaluation of MLLMs on video understanding, covering tasks such as event understanding, temporal reasoning, question answering, and multi-shot comprehension [Li et al., 2024, Fu et al., 2025, Liu et al., 2024b, Han et al., … view at source ↗

**Figure 2.** Figure 2: Example of a data instance in HAVEN. We construct a hierarchically structured, multimodal [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of tasks supported by HAVEN. units and visual segments. These representations serve both as inputs for certain tasks and as reference text for evaluation. More details on the dataset construction pipeline and annotation procedures are provided in the Appendix A. 3.3 Annotation Quality Assessment We evaluate annotation quality via human evaluation on 15 sampled videos across all annotation levels. … view at source ↗

**Figure 4.** Figure 4: Comparison across temporal understanding, multimodal grounding, and saliency ranking [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Summarization performance relationships across tasks. (a) text summarization quality (BERTScore) vs. keyframe selection performance (F1). (b) joint summarization quality. We observe that model performance is highly sensitive to task formulation, suggesting that multimodal capabilities cannot be reliably assessed using a single evaluation setup. For example, while some models achieve strong performance on… view at source ↗

**Figure 6.** Figure 6: Temporal understanding model gains from multimodal input. Across multiple tasks, we observe a consistent performance trend in Figs. 4a and 4c, that is multimodal inputs generally outperform visual-only inputs, which in turn outperform text-only inputs (which is represented by the texy-procy baseline). This indicates that models are able to benefit from the complementary information provided by multiple mo… view at source ↗

read the original abstract

While Multimodal Large Language Models (MLLMs) exhibit strong performance on standard video tasks, their ability to faithfully summarize and reason over complex narratives remains poorly evaluated. Existing summarization benchmarks fragment supervision across isolated granularities, such as keyframes, key shots, or disjointed text summaries, failing to capture the inherently hierarchical structure of cross-modal alignment. To address this critical gap, we introduce HAVEN, a hierarchically aligned multimodal benchmark for unified video understanding. HAVEN pioneers a fully granular (frame, shot, and video levels) and fully multimodal (video and text) dataset architecture, complete with explicit, continuous alignment between modalities. Built upon this unified annotation paradigm, we propose a comprehensive evaluation suite spanning summarization, temporal reasoning, multimodal grounding, and saliency ranking. Extensive benchmarking of state-of-the-art MLLMs exposes a persistent gap between surface-level textual fluency and grounded multimodal understanding. Ultimately, HAVEN advances the evaluation of multimodal systems beyond traditional QA formats, offering a rigorous, standardized testbed to drive future research in interpretable, hierarchical video understanding. We publicly release the dataset, benchmark suite, and evaluation protocols.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HAVEN introduces a hierarchical benchmark with frame-shot-video alignments for video-text, but the abstract leaves the continuous alignment mechanism and its validation unspecified.

read the letter

The main point is that HAVEN tries to fix fragmented video benchmarks by creating one dataset with explicit alignments at frame, shot, and video levels for both modalities. The authors argue this unified structure better matches how narratives work and lets them test MLLMs on summarization, temporal reasoning, grounding, and saliency in one place. They also release the data and protocols, which is straightforward and useful if the alignments actually deliver what is claimed.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces HAVEN, a hierarchically aligned multimodal benchmark for unified video understanding. It argues that existing summarization benchmarks fragment supervision across isolated granularities (keyframes, key shots, disjointed text summaries) and fail to capture the hierarchical structure of cross-modal alignment. HAVEN provides annotations at frame, shot, and video levels with explicit, continuous alignment between video and text modalities, together with an evaluation suite covering summarization, temporal reasoning, multimodal grounding, and saliency ranking. Benchmarking of state-of-the-art MLLMs is said to expose a persistent gap between surface-level textual fluency and grounded multimodal understanding. The dataset, benchmark suite, and evaluation protocols are to be released publicly.

Significance. If the claimed continuous cross-modal alignment is technically realized and validated, HAVEN could supply a more rigorous, standardized testbed that better reflects the hierarchical nature of video narratives than prior fragmented benchmarks. The public release of the full dataset and protocols would be a concrete contribution to the field.

major comments (1)

[Abstract] Abstract: The central claim that HAVEN supplies 'explicit, continuous alignment between modalities' at frame/shot/video granularities is not accompanied by any definition, algorithm, consistency check, or quantitative alignment score. It is therefore impossible to determine whether the architecture implements propagated semantic links or shared temporal anchors rather than independent per-level annotations.

minor comments (1)

[Abstract] Abstract: The phrase 'extensive benchmarking of state-of-the-art MLLMs' is stated without any numerical results, model names, or key performance deltas; adding one or two concrete findings would improve the summary paragraph.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for identifying an area where the abstract could better convey the technical details of our alignment approach. We address the comment below and will incorporate clarifications in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that HAVEN supplies 'explicit, continuous alignment between modalities' at frame/shot/video granularities is not accompanied by any definition, algorithm, consistency check, or quantitative alignment score. It is therefore impossible to determine whether the architecture implements propagated semantic links or shared temporal anchors rather than independent per-level annotations.

Authors: We appreciate the referee pointing out that the abstract does not sufficiently define or substantiate the alignment claim. In the full manuscript, Section 3.2 details the hierarchical annotation protocol: frame-level descriptions are generated first with precise temporal boundaries, then aggregated bottom-up into shot-level summaries via shared timestamps and semantic consistency, and finally into video-level narratives. This creates propagated semantic links rather than independent annotations. Consistency is enforced through a multi-stage review process with reported inter-annotator agreement scores (Section 4.1). We acknowledge the abstract is too concise on this point. We will revise the abstract to include a brief definition of the continuous alignment mechanism (via shared temporal anchors and semantic propagation) and add a reference to the methods section. We will also highlight an existing quantitative validation metric (semantic similarity across levels) more prominently in the revised text. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark dataset paper with no derivations or fitted predictions

full rationale

The paper introduces HAVEN as a new hierarchically aligned multimodal benchmark without any mathematical derivations, equations, predictions, or parameter fitting. The central description of 'explicit, continuous alignment' is a definitional claim about the dataset architecture itself rather than a result derived from prior quantities or self-citations. No load-bearing steps reduce to inputs by construction, and the evaluation suite is proposed independently based on the new paradigm. This is a standard self-contained dataset contribution with no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a benchmark creation paper; the central addition is the new dataset and protocols rather than derivations from prior results. No free parameters or invented entities are introduced beyond the benchmark definition itself.

axioms (1)

domain assumption Existing summarization benchmarks fragment supervision across isolated granularities such as keyframes, key shots, or disjointed text summaries, failing to capture the inherently hierarchical structure of cross-modal alignment.
This premise is stated directly in the abstract as the motivation for HAVEN.

pith-pipeline@v0.9.0 · 5727 in / 1235 out tokens · 50697 ms · 2026-05-20T07:33:57.891795+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HAVEN pioneers a fully granular (frame, shot, and video levels) and fully multimodal (video and text) dataset architecture, complete with explicit, continuous alignment between modalities.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 6 internal anchors

[1]

Toward unifying text segmentation and long document summarization

Sangwoo Cho, Kaiqiang Song, Xiaoyang Wang, Fei Liu, and Dong Yu. Toward unifying text segmentation and long document summarization. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 106–118,

work page 2022
[2]

The power of summary-source alignments

Ori Ernst, Ori Shapira, Aviv Slobodkin, Sharon Adar, Mohit Bansal, Jacob Goldberger, Ran Levy, and Ido Dagan. The power of summary-source alignments. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6527–6548,

work page 2024
[3]

Creating Summaries from User Videos

Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. Creating Summaries from User Videos. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 505–520,

work page 2014
[4]

Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, and Heng Wang

doi: 10.1007/978-3-319-10584-0_33. Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, and Heng Wang. Shot2story: A new benchmark for comprehensive understanding of multi-shot videos.arXiv preprint arXiv:2312.10300,

work page doi:10.1007/978-3-319-10584-0_33
[5]

Clipscore: A reference- free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference- free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528,

work page 2021
[6]

Evaluating object hallucination in large vision-language models

10 Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305,

work page 2023
[7]

Mmvir: A multi-modal and multi-granularity representation for long-range video understanding.arXiv preprint arXiv:2601.05495,

Zizhong Li, Haopeng Zhang, and Jiawei Zhang. Mmvir: A multi-modal and multi-granularity representation for long-range video understanding.arXiv preprint arXiv:2601.05495,

work page arXiv
[8]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu

doi: 10.1109/TMM.2023.3335875. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522,

work page doi:10.1109/tmm.2023.3335875 2023
[9]

Mdseval: A meta-evaluation benchmark for multimodal dialogue summarization

Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, and Saab Mansour. Mdseval: A meta-evaluation benchmark for multimodal dialogue summarization. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 14707–14727,

work page 2025
[10]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024a. Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu H...

work page 2024
[11]

OpenAI GPT-5 System Card

doi: 10.48550/arXiv.2601.03267. Mayu Otani, Yuta Nakashima, Esa Rahtu, and Janne Heikkila. Rethinking the Evaluation of Video Summaries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7596–7604,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267
[12]

Qwen2.5-VL Technical Report

Qwen Team. Qwen2.5-VL Technical Report.arXiv e-prints, 2025a. doi: 10.48550/arXiv.2502.13923. Qwen Team. Qwen3-VL Technical Report, 2025b. Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. InProceed- ings of the IEEE/CVF confer...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923
[13]

Aligning large multimodal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110,

work page 2024
[14]

Evaluating and improving factuality in multimodal abstractive summarization

David Wan and Mohit Bansal. Evaluating and improving factuality in multimodal abstractive summarization. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9632–9648,

work page 2022
[15]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

2015, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5197–5206, doi: 10.1109/CVPR.2015.7299156 Ivezić, Ž., Connolly, A

doi: 10.1109/CVPR.2015.7299154. Haohan Yuan and Haopeng Zhang. Understanding llm reasoning for abstractive summarization. arXiv preprint arXiv:2512.03503,

work page doi:10.1109/cvpr.2015.7299154 2015
[17]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[19]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

URLhttps://arxiv.org/abs/2507.01006. Junnan Zhu, Haoran Li, Tianshang Liu, Yu Zhou, Jiajun Zhang, and Chengqing Zong. MSMO: Multi- modal Summarization with Multimodal Output. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4154–4164,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

You are givenN frames from a video shown in chronological order

doi: 10.18653/v1/D18-1448. 12 Dataset Construction Pipeline Input Sources and Preprocessing Raw videos, benchmark metadata, and dataset-specific saliency annotations are unified into a common processing interface. Multi-Level Annotation Construction Frame-, shot-, and video-level text annotations are built progressively from shared structural signals. Rel...

work page doi:10.18653/v1/d18-1448

[1] [1]

Toward unifying text segmentation and long document summarization

Sangwoo Cho, Kaiqiang Song, Xiaoyang Wang, Fei Liu, and Dong Yu. Toward unifying text segmentation and long document summarization. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 106–118,

work page 2022

[2] [2]

The power of summary-source alignments

Ori Ernst, Ori Shapira, Aviv Slobodkin, Sharon Adar, Mohit Bansal, Jacob Goldberger, Ran Levy, and Ido Dagan. The power of summary-source alignments. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6527–6548,

work page 2024

[3] [3]

Creating Summaries from User Videos

Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. Creating Summaries from User Videos. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 505–520,

work page 2014

[4] [4]

Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, and Heng Wang

doi: 10.1007/978-3-319-10584-0_33. Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, and Heng Wang. Shot2story: A new benchmark for comprehensive understanding of multi-shot videos.arXiv preprint arXiv:2312.10300,

work page doi:10.1007/978-3-319-10584-0_33

[5] [5]

Clipscore: A reference- free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference- free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528,

work page 2021

[6] [6]

Evaluating object hallucination in large vision-language models

10 Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305,

work page 2023

[7] [7]

Mmvir: A multi-modal and multi-granularity representation for long-range video understanding.arXiv preprint arXiv:2601.05495,

Zizhong Li, Haopeng Zhang, and Jiawei Zhang. Mmvir: A multi-modal and multi-granularity representation for long-range video understanding.arXiv preprint arXiv:2601.05495,

work page arXiv

[8] [8]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu

doi: 10.1109/TMM.2023.3335875. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522,

work page doi:10.1109/tmm.2023.3335875 2023

[9] [9]

Mdseval: A meta-evaluation benchmark for multimodal dialogue summarization

Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, and Saab Mansour. Mdseval: A meta-evaluation benchmark for multimodal dialogue summarization. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 14707–14727,

work page 2025

[10] [10]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024a. Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu H...

work page 2024

[11] [11]

OpenAI GPT-5 System Card

doi: 10.48550/arXiv.2601.03267. Mayu Otani, Yuta Nakashima, Esa Rahtu, and Janne Heikkila. Rethinking the Evaluation of Video Summaries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7596–7604,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267

[12] [12]

Qwen2.5-VL Technical Report

Qwen Team. Qwen2.5-VL Technical Report.arXiv e-prints, 2025a. doi: 10.48550/arXiv.2502.13923. Qwen Team. Qwen3-VL Technical Report, 2025b. Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. InProceed- ings of the IEEE/CVF confer...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923

[13] [13]

Aligning large multimodal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110,

work page 2024

[14] [14]

Evaluating and improving factuality in multimodal abstractive summarization

David Wan and Mohit Bansal. Evaluating and improving factuality in multimodal abstractive summarization. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9632–9648,

work page 2022

[15] [15]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

2015, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5197–5206, doi: 10.1109/CVPR.2015.7299156 Ivezić, Ž., Connolly, A

doi: 10.1109/CVPR.2015.7299154. Haohan Yuan and Haopeng Zhang. Understanding llm reasoning for abstractive summarization. arXiv preprint arXiv:2512.03503,

work page doi:10.1109/cvpr.2015.7299154 2015

[17] [17]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[19] [19]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

URLhttps://arxiv.org/abs/2507.01006. Junnan Zhu, Haoran Li, Tianshang Liu, Yu Zhou, Jiajun Zhang, and Chengqing Zong. MSMO: Multi- modal Summarization with Multimodal Output. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4154–4164,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

You are givenN frames from a video shown in chronological order

doi: 10.18653/v1/D18-1448. 12 Dataset Construction Pipeline Input Sources and Preprocessing Raw videos, benchmark metadata, and dataset-specific saliency annotations are unified into a common processing interface. Multi-Level Annotation Construction Frame-, shot-, and video-level text annotations are built progressively from shared structural signals. Rel...

work page doi:10.18653/v1/d18-1448