FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

Gueter Josmy Faure; Hung-Ting Su; Jia-Fong Yeh; Min-Hung Chen; Winston H. Hsu

arxiv: 2605.19846 · v1 · pith:QQ76WJIFnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.CL

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

Gueter Josmy Faure , Min-Hung Chen , Jia-Fong Yeh , Hung-Ting Su , Winston H. Hsu This is my paper

Pith reviewed 2026-05-20 06:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords finebenchvlmsfine-grainedhuman-centricunderstandingfineagenthumanmodels

0 comments

The pith

Open-source vision-language models underperform on fine-grained human activity understanding in videos, but FineAgent boosts their performance on the FineBench benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates FineBench, a benchmark with over 199,000 questions on 64 long videos to test detailed grasp of human movements, interactions, and object handling. Evaluations find that open-source VLMs perform poorly compared to closed ones, especially when dealing with multiple people or tiny variations in actions. The authors respond by building FineAgent, which adds modules to locate key areas and describe them in detail, leading to better results on the benchmark for several models. A sympathetic reader would care because accurate fine-grained video understanding is essential for applications like monitoring or human-robot interaction that require noticing small details in behavior.

Core claim

FineBench is introduced as a human-centric video VQA benchmark with 199,420 multiple-choice QA pairs across 64 long-form videos of about 15 minutes each, with dense annotations on person movement, interaction, and object manipulation including compositional actions. The paper's evaluations show that while proprietary models achieve respectable performance, current open-source VLMs significantly underperform, with particular difficulties in spatial reasoning within multi-person scenes and in distinguishing subtle differences in human movements and interactions. To mitigate these issues, FineAgent is proposed as a modular framework that enhances VLMs through a Localizer and a Descriptor, and 1

What carries the argument

FineBench, the densely annotated long-form video VQA benchmark focused on fine-grained human activities with frame-level spatial and temporal grounding, and FineAgent, the modular framework that uses a Localizer to identify relevant video regions and a Descriptor to generate detailed descriptions for improved VLM reasoning.

Load-bearing premise

The benchmark's dense annotations and multiple-choice questions accurately measure genuine fine-grained understanding rather than rewarding superficial correlations or annotation artifacts.

What would settle it

A finding that FineAgent-enhanced models excel on FineBench but show no improvement when tested on independently annotated videos depicting similar fine-grained human activities would challenge whether the benchmark truly captures general understanding.

Figures

Figures reproduced from arXiv: 2605.19846 by Gueter Josmy Faure, Hung-Ting Su, Jia-Fong Yeh, Min-Hung Chen, Winston H. Hsu.

**Figure 1.** Figure 1: (a) Examples of question types in FineBench which go beyond summarization to cover person posture, person-object interaction, and person-person interaction. (b) The capture of temporal evolution of interaction labels across frames, emphasizing spatial granularity (e.g., distinguish individuals in the same frame) and temporal granularity (e.g., resolving transitions between similar but distinct actions). Ab… view at source ↗

**Figure 2.** Figure 2: Distribution of Annotated Persons per Keyframe. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: VLM performance analysis on FineBench detailing accuracy variations. (a) Performance degradation with increasing number [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Workflow of FineAgent. It begins with (1) prompt activation for the Localizer and Descriptor. (2) The Localizer and Descriptor, both Foundation models, provide bounding box coordinates and textual captions. (3) Finally, the VLM uses this processed information during inference. interactions compared to object-centric actions. To address these limitations, we propose FineAgent, a modular framework design… view at source ↗

read the original abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FineBench adds a dense new benchmark for fine-grained human activity VQA on long videos and pairs it with a modular FineAgent fix, but the evaluation skips a text-only baseline that undercuts the visual reasoning claims.

read the letter

The main points are that FineBench is a new benchmark with dense QA on long videos for fine-grained human activities, and FineAgent provides a modular way to improve open VLMs on it. The scale and specificity are the real contributions here. The paper introduces 199,420 multiple-choice questions across 64 videos, each around 15 minutes long. The focus is on detailed aspects like person movements, interactions, and object manipulations, including how actions compose together. This setup with long-form content and dense, grounded annotations does not match what is in prior human-centric benchmarks. Evaluations indicate that open-source VLMs perform worse than proprietary ones such as GPT-5, with particular difficulties in handling spatial reasoning when multiple people are present or when distinguishing close variations in actions. FineAgent incorporates a Localizer to find relevant parts and a Descriptor for details, and experiments show it helps lift performance on FineBench for different open models. One soft spot stands out. The stress-test note is right to flag the missing text-only baseline. Without testing models on the questions and choices alone, without any video input, it is difficult to confirm that the benchmark measures visual fine-grained understanding rather than language-based reasoning or leaked information from the annotations. If the paper does not include this or similar controls, the diagnosis of struggles with spatial and temporal reasoning becomes less convincing. Details on annotation process and quality checks would also help, but the baseline issue is the more central one. This work is for people developing or evaluating VLMs in computer vision and multimodal AI, particularly those interested in human activity understanding for downstream uses like robotics or video analytics. Readers can get value from the benchmark construction and the practical enhancement approach. I recommend sending it for peer review. The new benchmark resource is worth the time for referees to assess and suggest improvements on the evaluation side.

Referee Report

1 major / 0 minor

Summary. The paper introduces FineBench, a human-centric VQA benchmark with 199,420 densely annotated multiple-choice QA pairs across 64 long-form (15-minute) videos, targeting fine-grained aspects of person movement, interactions, and object manipulation. It reports that proprietary VLMs (e.g., GPT-5) achieve respectable results while open-source VLMs underperform, especially on spatial reasoning in multi-person scenes and subtle movement distinctions. To address these gaps, the authors propose FineAgent, a modular framework using a Localizer and Descriptor that yields consistent gains across several open VLMs on the benchmark.

Significance. If the benchmark questions genuinely require frame-level visual reasoning rather than linguistic priors, FineBench could serve as a useful large-scale testbed for fine-grained human activity understanding, an area relevant to applications such as robotics and surveillance. FineAgent provides a practical, modular enhancement strategy that avoids full model retraining. The work's value depends on verification that performance gaps reflect visual deficits.

major comments (1)

Evaluation section / abstract: The claim that open-source VLMs 'significantly underperform' and 'struggle particularly with spatial reasoning in multi-person scenes' (abstract) rests on the premise that the 199k QA pairs test visual fine-grained understanding. No text-only baseline (e.g., GPT-4 or Llama-3 answering from question text + options without video) is reported. This is load-bearing; if a language-only model exceeds chance substantially, the reported deficits and FineAgent gains (via Localizer+Descriptor) could reflect annotation artifacts or common-sense leakage instead of genuine visual limitations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The concern about verifying that FineBench truly evaluates visual fine-grained understanding, rather than linguistic priors, is well-taken and we address it directly below.

read point-by-point responses

Referee: The claim that open-source VLMs 'significantly underperform' and 'struggle particularly with spatial reasoning in multi-person scenes' (abstract) rests on the premise that the 199k QA pairs test visual fine-grained understanding. No text-only baseline (e.g., GPT-4 or Llama-3 answering from question text + options without video) is reported. This is load-bearing; if a language-only model exceeds chance substantially, the reported deficits and FineAgent gains (via Localizer+Descriptor) could reflect annotation artifacts or common-sense leakage instead of genuine visual limitations.

Authors: We agree that explicitly demonstrating the visual nature of the benchmark is important. FineBench questions target fine-grained details such as precise hand-object interactions, subtle movement distinctions, and spatial configurations in multi-person scenes that are not reliably solvable from question text and common-sense reasoning alone. For example, many questions concern specific left/right distinctions or exact sequences of actions visible only in particular frames. Nevertheless, we acknowledge that including text-only baselines would provide stronger evidence against linguistic leakage. We will add these baselines (using Llama-3 and GPT-4 in text-only mode) to the evaluation section in the revised manuscript, expecting performance near chance on these fine-grained items. This addition will also clarify that FineAgent's gains stem from its visual localization and description modules rather than textual cues. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivation chain or self-referential reduction

full rationale

The paper introduces FineBench as a new dataset of 199k QA pairs from 64 videos and evaluates existing VLMs plus a proposed FineAgent framework on it. All claims rest on fresh data collection, annotation, and model testing rather than any equation, fitted parameter, or prediction that reduces to the paper's own inputs. No self-citation is invoked as a load-bearing uniqueness theorem or ansatz; the work is self-contained against external benchmarks and does not rename known results or smuggle assumptions via prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark and framework paper with no mathematical derivations. No free parameters, axioms, or invented physical entities are introduced; the Localizer and Descriptor are engineering modules within the proposed FineAgent system.

pith-pipeline@v0.9.0 · 5805 in / 1268 out tokens · 45430 ms · 2026-05-20T06:03:21.954254+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos... focusing on detailed person movement, person interaction, and object manipulation
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FineAgent integrates two key components: a Localizer... and a Descriptor that generates frame summaries

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 9 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024. 2, 3

work page arXiv 2024
[3]

Hv-mmbench: Benchmark- ing mllms for human-centric video understanding.arXiv preprint arXiv:2507.04909, 2025

Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Xinwei He, et al. Hv-mmbench: Benchmark- ing mllms for human-centric video understanding.arXiv preprint arXiv:2507.04909, 2025. 3

work page arXiv 2025
[4]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 3, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024. 5

work page 2024
[6]

Hsu, and Shang-Hong Lai

Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung- Ting Su, Winston H. Hsu, and Shang-Hong Lai. Hermes: temporal-coherent long-form understanding with episodes and semantics, 2024. 3

work page 2024
[7]

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Ying Cheng, Hung-Ting Su, Yung-Hao Tang, Shang-Hong Lai, and Winston H. Hsu. Moviecore: Cognitive reasoning in movies, 2025. 2

work page 2025
[8]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever compre- hensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Ava: A video dataset of spatio-temporally localized atomic visual actions

Chunhui Gu, Chen Sun, David A Ross, Carl V ondrick, Car- oline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056,

work page
[10]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2, 3

work page 2024
[12]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3

work page 2023
[13]

Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 2, 3

work page 2023
[14]

SmolVLM: Redefining small and efficient multimodal models

Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Hello gpt-4o.https : / / openai

OpenAI. Hello gpt-4o.https : / / openai . com / index/hello-gpt-4o/, 2024. [Accessed 01-11-2024]. 6

work page 2024
[16]

Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025

OpenAI. Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025. [Accessed 31- 08-2025]. 5, 6

work page 2025
[17]

Humanibench: A human-centric framework for large multimodal models evaluation.arXiv preprint arXiv:2505.11454, 2025

Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ash- mal Vayani, Mukund S Chettiar, Amandeep Singh, Mubarak Shah, and Deval Pandya. Humanibench: A human-centric framework for large multimodal models evaluation.arXiv preprint arXiv:2505.11454, 2025. 2

work page arXiv 2025
[18]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2, 3

work page 2024
[19]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Star: A benchmark for situated reasoning in real-world videos

Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2024. 2, 3

work page 2024
[21]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 3

work page 2024
[22]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 2, 3

work page 2021
[23]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 2, 3

work page 2016
[24]

xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yu- tong Dai, Michael S Ryoo, et al. xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024. 6

work page arXiv 2024
[25]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 3, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024. 3, 6, 8

work page internal anchor Pith review arXiv 2024
[27]

mplug- owl2: Revolutionizing multi-modal large language model with modality collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug- owl2: Revolutionizing multi-modal large language model with modality collaboration. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 13040–13051, 2024. 6

work page 2024
[28]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 2, 3

work page 2019
[29]

Evf-sam: Early vision-language fusion for text-prompted segment anything model, 2024

Yuxuan Zhang, Tianheng Cheng, Lianghui Zhu, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model, 2024. 7

work page 2024
[30]

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

Ting Zhou, Daoyuan Chen, Qirui Jiao, Bolin Ding, Yaliang Li, and Ying Shen. Humanvbench: Exploring human-centric video understanding capabilities of mllms with synthetic benchmark data.arXiv preprint arXiv:2412.17574, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024. 2, 3

work page arXiv 2024

[3] [3]

Hv-mmbench: Benchmark- ing mllms for human-centric video understanding.arXiv preprint arXiv:2507.04909, 2025

Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Xinwei He, et al. Hv-mmbench: Benchmark- ing mllms for human-centric video understanding.arXiv preprint arXiv:2507.04909, 2025. 3

work page arXiv 2025

[4] [4]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 3, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024. 5

work page 2024

[6] [6]

Hsu, and Shang-Hong Lai

Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung- Ting Su, Winston H. Hsu, and Shang-Hong Lai. Hermes: temporal-coherent long-form understanding with episodes and semantics, 2024. 3

work page 2024

[7] [7]

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Ying Cheng, Hung-Ting Su, Yung-Hao Tang, Shang-Hong Lai, and Winston H. Hsu. Moviecore: Cognitive reasoning in movies, 2025. 2

work page 2025

[8] [8]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever compre- hensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Ava: A video dataset of spatio-temporally localized atomic visual actions

Chunhui Gu, Chen Sun, David A Ross, Carl V ondrick, Car- oline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056,

work page

[10] [10]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2, 3

work page 2024

[12] [12]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3

work page 2023

[13] [13]

Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 2, 3

work page 2023

[14] [14]

SmolVLM: Redefining small and efficient multimodal models

Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Hello gpt-4o.https : / / openai

OpenAI. Hello gpt-4o.https : / / openai . com / index/hello-gpt-4o/, 2024. [Accessed 01-11-2024]. 6

work page 2024

[16] [16]

Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025

OpenAI. Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025. [Accessed 31- 08-2025]. 5, 6

work page 2025

[17] [17]

Humanibench: A human-centric framework for large multimodal models evaluation.arXiv preprint arXiv:2505.11454, 2025

Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ash- mal Vayani, Mukund S Chettiar, Amandeep Singh, Mubarak Shah, and Deval Pandya. Humanibench: A human-centric framework for large multimodal models evaluation.arXiv preprint arXiv:2505.11454, 2025. 2

work page arXiv 2025

[18] [18]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2, 3

work page 2024

[19] [19]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Star: A benchmark for situated reasoning in real-world videos

Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2024. 2, 3

work page 2024

[21] [21]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 3

work page 2024

[22] [22]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 2, 3

work page 2021

[23] [23]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 2, 3

work page 2016

[24] [24]

xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yu- tong Dai, Michael S Ryoo, et al. xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024. 6

work page arXiv 2024

[25] [25]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 3, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024. 3, 6, 8

work page internal anchor Pith review arXiv 2024

[27] [27]

mplug- owl2: Revolutionizing multi-modal large language model with modality collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug- owl2: Revolutionizing multi-modal large language model with modality collaboration. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 13040–13051, 2024. 6

work page 2024

[28] [28]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 2, 3

work page 2019

[29] [29]

Evf-sam: Early vision-language fusion for text-prompted segment anything model, 2024

Yuxuan Zhang, Tianheng Cheng, Lianghui Zhu, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model, 2024. 7

work page 2024

[30] [30]

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

Ting Zhou, Daoyuan Chen, Qirui Jiao, Bolin Ding, Yaliang Li, and Ying Shen. Humanvbench: Exploring human-centric video understanding capabilities of mllms with synthetic benchmark data.arXiv preprint arXiv:2412.17574, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024