arxiv: 2604.25186 · v2 · submitted 2026-04-28 · 💻 cs.CV · cs.CE· cs.MM

Recognition: unknown

FCMBench-Video: Benchmarking Document Video Intelligence

Runze Cui , Fangxin Shang , Yehui Yang , Qing Yang , Yanwu Xu , Tao Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:08 UTC · model grok-4.3

classification 💻 cs.CV cs.CEcs.MM

keywords document video understandingVideo-MLLM evaluationevidence-grounded reasoningtemporal groundingcross-document validationfinancial document analysisauthenticity verificationbenchmark construction

0 comments

The pith

FCMBench-Video supplies a benchmark that separates Video-MLLM performance on temporal grounding and evidence integration for document videos captured under realistic conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FCMBench-Video as a dataset and evaluation suite for document-video intelligence, where models must integrate evidence across frames while respecting acquisition cues that matter for financial verification tasks. Construction proceeds by recording reusable single-document clips, introducing controlled degradations, and composing them into longer multi-document videos paired with expert-annotated questions. Tests across nine recent Video-MLLMs demonstrate clear capability distinctions: counting accuracy declines with video length, cross-document validation and evidence-grounded selection expose integration limits, and visual prompt injection adds a robustness axis. The resulting score spread is wide and roughly bell-shaped, showing the benchmark is neither saturated nor trivial. This matters for applications that require both decision accuracy and traceable evidence in authenticity-sensitive settings.

Core claim

FCMBench-Video is assembled from 495 atomic videos into 1,200 long-form videos that contain 11,322 expert-annotated question-answer pairs spanning 28 document types, 20- to 60-second duration tiers, and balanced Chinese and English instances. When nine Video-MLLMs are evaluated, counting proves most sensitive to duration, Cross-Document Validation and Evidence-Grounded Selection surface higher-order evidence integration, Visual Prompt Injection supplies an orthogonal robustness measure, and the aggregate scores form a broad, approximately normal distribution.

What carries the argument

The atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with expert annotations.

If this is right

Counting accuracy degrades measurably as video duration increases from 20 s to 60 s.
Cross-Document Validation and Evidence-Grounded Selection expose limits in multi-document evidence synthesis that single-frame models cannot address.
Visual Prompt Injection reveals a distinct failure mode orthogonal to temporal grounding.
The broad, bell-shaped score distribution allows future model releases to be ranked without hitting a performance ceiling.
The benchmark supplies a reproducible yardstick for tracking progress on document-video tasks in credit-domain applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that improve on FCMBench-Video may also reduce error rates in live financial onboarding pipelines that already rely on short video clips.
Adding explicit anti-fraud cues such as lighting consistency or device metadata to the annotation protocol could tighten the link between benchmark scores and real fraud detection.
Combining FCMBench-Video with existing static-document benchmarks would quantify the incremental value of temporal evidence streams.
Repeating the construction workflow with different degradation profiles could test whether current models are brittle to specific capture artifacts common in mobile submissions.

Load-bearing premise

The atomic-acquisition and composition workflow with controlled degradations and expert annotations produces videos and questions that faithfully represent real-world, authenticity-sensitive document video capture conditions.

What would settle it

A direct side-by-side comparison in which the same models are run on FCMBench-Video and on a matched set of field-collected credit-review videos; divergence in the ordering of model strengths or in the relative difficulty of counting versus integration tasks would indicate the benchmark does not capture authentic conditions.

Figures

Figures reproduced from arXiv: 2604.25186 by Fangxin Shang, Qing Yang, Runze Cui, Tao Chen, Yanwu Xu, Yehui Yang.

**Figure 1.** Figure 1: Overview of FCMBench-Video. A document video is represented as a temporally ordered stack of oblique view at source ↗

**Figure 2.** Figure 2: Atomic–Degradation–Composition (ADC) workflow for constructing privacy-compliant document videos. view at source ↗

**Figure 3.** Figure 3: Average number of documents per video across different video durations. Longer videos contain more view at source ↗

**Figure 4.** Figure 4: Frame-to-frame CLIP similarity over a representative composed video view at source ↗

**Figure 5.** Figure 5: Overall performance analysis on FCMBench-Video. (a) Distribution of model overall scores, where the view at source ↗

**Figure 6.** Figure 6: Overall comparison of Attack Success Rate (ASR) between Visual Prompt Injection (w/o CoT) and Visual Prompt Injection (w/ CoT) on the zh-CN and en-US subsets. The figure includes all nine models used in the main trend analysis. Explicit intermediate reasoning does not guarantee lower ASR under the current visual prompt-injection construction: some models benefit noticeably, while others remain unstable or … view at source ↗

**Figure 7.** Figure 7: Duration-stratified perception results on the zh-CN subset. The same nine models are evaluated across 20s, view at source ↗

read the original abstract

Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with prescribed temporal spans. FCMBench-Video is built from 495 atomic videos composed into 1,200 long-form videos paired with 11,322 expert-annotated question--answer instances, covering 28 document types over 20s--60s duration tiers and 5,960 Chinese / 5,362 English instances. Evaluations on nine recent Video-MLLMs show that FCMBench-Video provides meaningful separation across systems and capabilities: counting is the most duration-sensitive task, Cross-Document Validation and Evidence-Grounded Selection probe higher-level evidence integration, and Visual Prompt Injection provides a complementary robustness dimension. The overall score distribution is broad and approximately bell-shaped, indicating a benchmark that is neither saturated nor dominated by trivial cases. Together, these results position FCMBench-Video as a reproducible benchmark for tracking Video-MLLM progress on document-video understanding and probing capability boundaries in authenticity-sensitive credit-domain applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FCMBench-Video is a new benchmark for document video tasks in finance that gives useful capability separations, but its atomic-clip construction may not fully match real-world correlated capture artifacts.

read the letter

The main takeaway is that this paper releases FCMBench-Video, a dataset of 1,200 long-form document videos built from 495 atomic clips, paired with 11,322 QA instances across 28 document types and two languages. It targets temporal grounding, cross-document evidence use, and robustness under degradations for financial review workflows. The evaluations on nine video MLLMs show clear task differences, with counting most affected by duration and integration tasks separating higher-level reasoning, plus a broad score distribution that avoids saturation. That part is concrete and worth having as a reference point. The construction workflow is practical for scale and privacy, and the task suite covers perception through reasoning in a way that aligns with the stated use case. The soft spot is the data generation itself. Recording single-document clips, applying degradations independently, then assembling sequences risks missing the joint statistics of real captures, such as consistent blur, lighting drift, or motion patterns across document transitions in handheld or scanner videos. If those correlations are absent, the reported separations could partly trace to the synthetic process rather than authentic model behavior. The abstract also leaves inter-annotator agreement and exact question validation details thin, which matters for a benchmark claiming expert annotations. This work is aimed at researchers evaluating or improving video multimodal models on temporally extended document tasks, especially in finance or compliance settings. Anyone needing baselines for evidence integration or duration sensitivity would find the numbers and task definitions useful. It deserves peer review because the scale and task coverage are substantial enough to warrant external checks on the pipeline and results. I would send it out, with reviewers asked to examine how closely the composed videos reproduce real acquisition statistics.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FCMBench-Video, a benchmark for Video-MLLMs on document video intelligence in financial credit-review settings. It describes an atomic-acquisition and composition workflow that records 495 single-document clips, applies controlled degradations, and assembles them into 1,200 long-form videos (20–60 s) paired with 11,322 expert-annotated QA instances spanning 28 document types and two languages. Evaluations on nine recent Video-MLLMs are reported to show meaningful capability separations: counting is most duration-sensitive, Cross-Document Validation and Evidence-Grounded Selection probe higher-level integration, Visual Prompt Injection tests robustness, and the overall score distribution is broad and approximately bell-shaped.

Significance. If the constructed videos faithfully reproduce the joint distributions of real-world acquisition artifacts and temporal dynamics, FCMBench-Video would be a useful, reproducible benchmark for tracking progress on temporally grounded, evidence-integrative document understanding in authenticity-sensitive domains. The scale, privacy-compliant construction, and reported non-saturation of scores would distinguish it from existing static-document or general-video benchmarks.

major comments (2)

[§3 (Construction Workflow)] §3 (Construction Workflow): The central claim that FCMBench-Video yields meaningful capability separation rests on the atomic-acquisition and composition pipeline producing videos whose degradation statistics and temporal dynamics match real capture conditions. Applying controlled degradations independently to atomic clips before assembly risks missing correlated artifacts (e.g., consistent motion blur or illumination drift across document transitions) that arise in actual handheld or scanner-based sequences; without quantitative validation against real-world joint distributions, the reported separations could reflect synthetic properties rather than genuine model differences.
[§5 (Evaluation Results)] §5 (Evaluation Results): The abstract and evaluation summary assert “meaningful separation” and a “broad and approximately bell-shaped” score distribution, yet provide no quantitative details on inter-annotator agreement for the 11,322 QA instances, exact question-design criteria, or statistical significance tests (e.g., p-values or confidence intervals) for the claimed duration sensitivity of counting or the integration probed by Cross-Document Validation. These omissions undermine confidence that the separations are reliable and reproducible.

minor comments (2)

[Abstract and §4] Abstract and §4: The phrase “expert-annotated” is used without reporting annotation guidelines, number of annotators, or agreement metrics; adding these would improve reproducibility.
[Table 1 or equivalent] Table 1 or equivalent: A summary table listing the nine evaluated Video-MLLMs, their parameter counts, and key per-task scores would make the separation claims easier to inspect.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the corresponding revisions to the manuscript.

read point-by-point responses

Referee: [§3 (Construction Workflow)] The central claim that FCMBench-Video yields meaningful capability separation rests on the atomic-acquisition and composition pipeline producing videos whose degradation statistics and temporal dynamics match real capture conditions. Applying controlled degradations independently to atomic clips before assembly risks missing correlated artifacts (e.g., consistent motion blur or illumination drift across document transitions) that arise in actual handheld or scanner-based sequences; without quantitative validation against real-world joint distributions, the reported separations could reflect synthetic properties rather than genuine model differences.

Authors: We agree that independent degradation application does not capture all possible correlated artifacts that occur in real handheld or scanner sequences. Our pipeline prioritizes reproducibility, privacy compliance, and controlled experimentation over exhaustive replication of every joint distribution, which would be difficult to scale. The base atomic clips are captured under realistic conditions, and the degradations are derived from observed real-world artifact statistics. In the revised manuscript we have expanded §3 with an explicit discussion of the independence assumption, its rationale, and its limitations. We have also added qualitative side-by-side comparisons of composed videos versus real document videos in the supplementary material to illustrate the degree of realism achieved. revision: yes
Referee: [§5 (Evaluation Results)] The abstract and evaluation summary assert “meaningful separation” and a “broad and approximately bell-shaped” score distribution, yet provide no quantitative details on inter-annotator agreement for the 11,322 QA instances, exact question-design criteria, or statistical significance tests (e.g., p-values or confidence intervals) for the claimed duration sensitivity of counting or the integration probed by Cross-Document Validation. These omissions undermine confidence that the separations are reliable and reproducible.

Authors: We acknowledge that the original submission omitted these quantitative details. In the revised manuscript we have added a dedicated subsection in §4 that specifies the question-design criteria, taxonomy, and annotation guidelines used to create the 11,322 QA instances. We now report inter-annotator agreement statistics obtained during QA validation. In §5 we have incorporated statistical significance tests (including p-values and confidence intervals) to support the duration-sensitivity observations for counting and the capability differences observed for Cross-Document Validation and Evidence-Grounded Selection. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and external model evaluations are self-contained

full rationale

The paper presents a data-construction pipeline (atomic clips with controlled degradations assembled into long-form videos) and reports empirical results from running nine external Video-MLLMs on the resulting 1,200 videos and 11,322 QA pairs. No equations, parameter fitting, derivations, or predictions appear. Claims of capability separation rest on direct model outputs rather than any self-referential definition or self-citation chain. The work is therefore a standard benchmark release whose central assertions are falsifiable against independent model runs and do not reduce to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the representativeness of the synthetic construction pipeline and the reliability of expert annotations; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Controlled degradations applied to atomic clips produce videos representative of real-world capture conditions
Invoked in the atomic-acquisition and composition workflow description.
domain assumption Expert annotations for the 11,322 QA instances are accurate and consistent
Relied upon for all evaluation results without reported agreement metrics.

pith-pipeline@v0.9.0 · 5602 in / 1278 out tokens · 41488 ms · 2026-05-07T17:08:37.133436+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 10 canonical work pages · 5 internal anchors

[1]

Fcmbench: A comprehensive financial credit multimodal benchmark for real-world applications, 2026

Yehui Yang, Dalu Yang, Wenshuo Zhou, Fangxin Shang, Yifan Liu, Jie Ren, Haojun Fei, Qing Yang, Yanwu Xu, and Tao Chen. Fcmbench: A comprehensive financial credit multimodal benchmark for real-world applications, 2026

2026
[2]

Mvbench: A comprehensive multi-modal video understand- ing benchmark.arXiv preprint arXiv:2311.17005, 2023

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of 14 FCMBench-Video the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2311.17005

work page arXiv 2024
[3]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, Xing Sun, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.a...

work page internal anchor Pith review arXiv 2024
[4]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

work page internal anchor Pith review arXiv 2025
[5]

Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images.arXiv preprint arXiv:2007.00398, 2021. Accepted at W ACV 2021

work page arXiv 2007
[6]

Jaume, H

Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. Funsd: A dataset for form understanding in noisy scanned documents.arXiv preprint arXiv:1905.13538, 2019

work page arXiv 1905
[7]

Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V . Jawahar. Icdar2019 competition on scanned receipt ocr and information extraction.arXiv preprint arXiv:2103.10213, 2021. Related DOI: 10.1109/ICDAR.2019.00244

work page doi:10.1109/icdar.2019.00244 2021
[8]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine ...

2021
[9]

A new era of intelligence with gemini 3, November 2025

Google. A new era of intelligence with gemini 3, November 2025. Google official blog, published Nov 18, 2025

2025
[10]

Video understanding, 2026

Google. Video understanding, 2026. Gemini API documentation, accessed Apr 8, 2026

2026
[11]

Introduction to techniques used in seed1.6, June 2025

ByteDance Seed Team. Introduction to techniques used in seed1.6, June 2025. ByteDance Seed official blog, published Jun 25, 2025

2025
[12]

Kimi-VL Technical Report

Moonshot AI et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review arXiv 2025
[13]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhe Chen, Weiyun Wang, Jinguo Zhu, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review arXiv 2025
[14]

5 technical report

Shiyin Lu, Yang Li, Yu Xia, et al. Ovis2.5 technical report.arXiv preprint arXiv:2508.11737, 2025

work page arXiv 2025
[15]

Qwen3-omni, September 2025

Qwen Team. Qwen3-omni, September 2025. Qwen official GitHub repository, released Sep 22, 2025

2025
[16]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review arXiv 2025
[17]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. Qwen official blog, published Feb 15, 2026. 15

2026