pith. machine review for the scientific record. sign in

arxiv: 2604.25186 · v2 · submitted 2026-04-28 · 💻 cs.CV · cs.CE· cs.MM

Recognition: unknown

FCMBench-Video: Benchmarking Document Video Intelligence

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:08 UTC · model grok-4.3

classification 💻 cs.CV cs.CEcs.MM
keywords document video understandingVideo-MLLM evaluationevidence-grounded reasoningtemporal groundingcross-document validationfinancial document analysisauthenticity verificationbenchmark construction
0
0 comments X

The pith

FCMBench-Video supplies a benchmark that separates Video-MLLM performance on temporal grounding and evidence integration for document videos captured under realistic conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FCMBench-Video as a dataset and evaluation suite for document-video intelligence, where models must integrate evidence across frames while respecting acquisition cues that matter for financial verification tasks. Construction proceeds by recording reusable single-document clips, introducing controlled degradations, and composing them into longer multi-document videos paired with expert-annotated questions. Tests across nine recent Video-MLLMs demonstrate clear capability distinctions: counting accuracy declines with video length, cross-document validation and evidence-grounded selection expose integration limits, and visual prompt injection adds a robustness axis. The resulting score spread is wide and roughly bell-shaped, showing the benchmark is neither saturated nor trivial. This matters for applications that require both decision accuracy and traceable evidence in authenticity-sensitive settings.

Core claim

FCMBench-Video is assembled from 495 atomic videos into 1,200 long-form videos that contain 11,322 expert-annotated question-answer pairs spanning 28 document types, 20- to 60-second duration tiers, and balanced Chinese and English instances. When nine Video-MLLMs are evaluated, counting proves most sensitive to duration, Cross-Document Validation and Evidence-Grounded Selection surface higher-order evidence integration, Visual Prompt Injection supplies an orthogonal robustness measure, and the aggregate scores form a broad, approximately normal distribution.

What carries the argument

The atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with expert annotations.

If this is right

  • Counting accuracy degrades measurably as video duration increases from 20 s to 60 s.
  • Cross-Document Validation and Evidence-Grounded Selection expose limits in multi-document evidence synthesis that single-frame models cannot address.
  • Visual Prompt Injection reveals a distinct failure mode orthogonal to temporal grounding.
  • The broad, bell-shaped score distribution allows future model releases to be ranked without hitting a performance ceiling.
  • The benchmark supplies a reproducible yardstick for tracking progress on document-video tasks in credit-domain applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that improve on FCMBench-Video may also reduce error rates in live financial onboarding pipelines that already rely on short video clips.
  • Adding explicit anti-fraud cues such as lighting consistency or device metadata to the annotation protocol could tighten the link between benchmark scores and real fraud detection.
  • Combining FCMBench-Video with existing static-document benchmarks would quantify the incremental value of temporal evidence streams.
  • Repeating the construction workflow with different degradation profiles could test whether current models are brittle to specific capture artifacts common in mobile submissions.

Load-bearing premise

The atomic-acquisition and composition workflow with controlled degradations and expert annotations produces videos and questions that faithfully represent real-world, authenticity-sensitive document video capture conditions.

What would settle it

A direct side-by-side comparison in which the same models are run on FCMBench-Video and on a matched set of field-collected credit-review videos; divergence in the ordering of model strengths or in the relative difficulty of counting versus integration tasks would indicate the benchmark does not capture authentic conditions.

Figures

Figures reproduced from arXiv: 2604.25186 by Fangxin Shang, Qing Yang, Runze Cui, Tao Chen, Yanwu Xu, Yehui Yang.

Figure 1
Figure 1. Figure 1: Overview of FCMBench-Video. A document video is represented as a temporally ordered stack of oblique view at source ↗
Figure 2
Figure 2. Figure 2: Atomic–Degradation–Composition (ADC) workflow for constructing privacy-compliant document videos. view at source ↗
Figure 3
Figure 3. Figure 3: Average number of documents per video across different video durations. Longer videos contain more view at source ↗
Figure 4
Figure 4. Figure 4: Frame-to-frame CLIP similarity over a representative composed video view at source ↗
Figure 5
Figure 5. Figure 5: Overall performance analysis on FCMBench-Video. (a) Distribution of model overall scores, where the view at source ↗
Figure 6
Figure 6. Figure 6: Overall comparison of Attack Success Rate (ASR) between Visual Prompt Injection (w/o CoT) and Visual Prompt Injection (w/ CoT) on the zh-CN and en-US subsets. The figure includes all nine models used in the main trend analysis. Explicit intermediate reasoning does not guarantee lower ASR under the current visual prompt-injection construction: some models benefit noticeably, while others remain unstable or … view at source ↗
Figure 7
Figure 7. Figure 7: Duration-stratified perception results on the zh-CN subset. The same nine models are evaluated across 20s, view at source ↗
read the original abstract

Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with prescribed temporal spans. FCMBench-Video is built from 495 atomic videos composed into 1,200 long-form videos paired with 11,322 expert-annotated question--answer instances, covering 28 document types over 20s--60s duration tiers and 5,960 Chinese / 5,362 English instances. Evaluations on nine recent Video-MLLMs show that FCMBench-Video provides meaningful separation across systems and capabilities: counting is the most duration-sensitive task, Cross-Document Validation and Evidence-Grounded Selection probe higher-level evidence integration, and Visual Prompt Injection provides a complementary robustness dimension. The overall score distribution is broad and approximately bell-shaped, indicating a benchmark that is neither saturated nor dominated by trivial cases. Together, these results position FCMBench-Video as a reproducible benchmark for tracking Video-MLLM progress on document-video understanding and probing capability boundaries in authenticity-sensitive credit-domain applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FCMBench-Video, a benchmark for Video-MLLMs on document video intelligence in financial credit-review settings. It describes an atomic-acquisition and composition workflow that records 495 single-document clips, applies controlled degradations, and assembles them into 1,200 long-form videos (20–60 s) paired with 11,322 expert-annotated QA instances spanning 28 document types and two languages. Evaluations on nine recent Video-MLLMs are reported to show meaningful capability separations: counting is most duration-sensitive, Cross-Document Validation and Evidence-Grounded Selection probe higher-level integration, Visual Prompt Injection tests robustness, and the overall score distribution is broad and approximately bell-shaped.

Significance. If the constructed videos faithfully reproduce the joint distributions of real-world acquisition artifacts and temporal dynamics, FCMBench-Video would be a useful, reproducible benchmark for tracking progress on temporally grounded, evidence-integrative document understanding in authenticity-sensitive domains. The scale, privacy-compliant construction, and reported non-saturation of scores would distinguish it from existing static-document or general-video benchmarks.

major comments (2)
  1. [§3 (Construction Workflow)] §3 (Construction Workflow): The central claim that FCMBench-Video yields meaningful capability separation rests on the atomic-acquisition and composition pipeline producing videos whose degradation statistics and temporal dynamics match real capture conditions. Applying controlled degradations independently to atomic clips before assembly risks missing correlated artifacts (e.g., consistent motion blur or illumination drift across document transitions) that arise in actual handheld or scanner-based sequences; without quantitative validation against real-world joint distributions, the reported separations could reflect synthetic properties rather than genuine model differences.
  2. [§5 (Evaluation Results)] §5 (Evaluation Results): The abstract and evaluation summary assert “meaningful separation” and a “broad and approximately bell-shaped” score distribution, yet provide no quantitative details on inter-annotator agreement for the 11,322 QA instances, exact question-design criteria, or statistical significance tests (e.g., p-values or confidence intervals) for the claimed duration sensitivity of counting or the integration probed by Cross-Document Validation. These omissions undermine confidence that the separations are reliable and reproducible.
minor comments (2)
  1. [Abstract and §4] Abstract and §4: The phrase “expert-annotated” is used without reporting annotation guidelines, number of annotators, or agreement metrics; adding these would improve reproducibility.
  2. [Table 1 or equivalent] Table 1 or equivalent: A summary table listing the nine evaluated Video-MLLMs, their parameter counts, and key per-task scores would make the separation claims easier to inspect.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [§3 (Construction Workflow)] The central claim that FCMBench-Video yields meaningful capability separation rests on the atomic-acquisition and composition pipeline producing videos whose degradation statistics and temporal dynamics match real capture conditions. Applying controlled degradations independently to atomic clips before assembly risks missing correlated artifacts (e.g., consistent motion blur or illumination drift across document transitions) that arise in actual handheld or scanner-based sequences; without quantitative validation against real-world joint distributions, the reported separations could reflect synthetic properties rather than genuine model differences.

    Authors: We agree that independent degradation application does not capture all possible correlated artifacts that occur in real handheld or scanner sequences. Our pipeline prioritizes reproducibility, privacy compliance, and controlled experimentation over exhaustive replication of every joint distribution, which would be difficult to scale. The base atomic clips are captured under realistic conditions, and the degradations are derived from observed real-world artifact statistics. In the revised manuscript we have expanded §3 with an explicit discussion of the independence assumption, its rationale, and its limitations. We have also added qualitative side-by-side comparisons of composed videos versus real document videos in the supplementary material to illustrate the degree of realism achieved. revision: yes

  2. Referee: [§5 (Evaluation Results)] The abstract and evaluation summary assert “meaningful separation” and a “broad and approximately bell-shaped” score distribution, yet provide no quantitative details on inter-annotator agreement for the 11,322 QA instances, exact question-design criteria, or statistical significance tests (e.g., p-values or confidence intervals) for the claimed duration sensitivity of counting or the integration probed by Cross-Document Validation. These omissions undermine confidence that the separations are reliable and reproducible.

    Authors: We acknowledge that the original submission omitted these quantitative details. In the revised manuscript we have added a dedicated subsection in §4 that specifies the question-design criteria, taxonomy, and annotation guidelines used to create the 11,322 QA instances. We now report inter-annotator agreement statistics obtained during QA validation. In §5 we have incorporated statistical significance tests (including p-values and confidence intervals) to support the duration-sensitivity observations for counting and the capability differences observed for Cross-Document Validation and Evidence-Grounded Selection. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and external model evaluations are self-contained

full rationale

The paper presents a data-construction pipeline (atomic clips with controlled degradations assembled into long-form videos) and reports empirical results from running nine external Video-MLLMs on the resulting 1,200 videos and 11,322 QA pairs. No equations, parameter fitting, derivations, or predictions appear. Claims of capability separation rest on direct model outputs rather than any self-referential definition or self-citation chain. The work is therefore a standard benchmark release whose central assertions are falsifiable against independent model runs and do not reduce to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the representativeness of the synthetic construction pipeline and the reliability of expert annotations; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Controlled degradations applied to atomic clips produce videos representative of real-world capture conditions
    Invoked in the atomic-acquisition and composition workflow description.
  • domain assumption Expert annotations for the 11,322 QA instances are accurate and consistent
    Relied upon for all evaluation results without reported agreement metrics.

pith-pipeline@v0.9.0 · 5602 in / 1278 out tokens · 41488 ms · 2026-05-07T17:08:37.133436+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 10 canonical work pages · 5 internal anchors

  1. [1]

    Fcmbench: A comprehensive financial credit multimodal benchmark for real-world applications, 2026

    Yehui Yang, Dalu Yang, Wenshuo Zhou, Fangxin Shang, Yifan Liu, Jie Ren, Haojun Fei, Qing Yang, Yanwu Xu, and Tao Chen. Fcmbench: A comprehensive financial credit multimodal benchmark for real-world applications, 2026

  2. [2]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark.arXiv preprint arXiv:2311.17005, 2023

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of 14 FCMBench-Video the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2311.17005

  3. [3]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, Xing Sun, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.a...

  4. [4]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

  5. [5]

    Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images.arXiv preprint arXiv:2007.00398, 2021. Accepted at W ACV 2021

  6. [6]

    Jaume, H

    Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. Funsd: A dataset for form understanding in noisy scanned documents.arXiv preprint arXiv:1905.13538, 2019

  7. [7]

    Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V . Jawahar. Icdar2019 competition on scanned receipt ocr and information extraction.arXiv preprint arXiv:2103.10213, 2021. Related DOI: 10.1109/ICDAR.2019.00244

  8. [8]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine ...

  9. [9]

    A new era of intelligence with gemini 3, November 2025

    Google. A new era of intelligence with gemini 3, November 2025. Google official blog, published Nov 18, 2025

  10. [10]

    Video understanding, 2026

    Google. Video understanding, 2026. Gemini API documentation, accessed Apr 8, 2026

  11. [11]

    Introduction to techniques used in seed1.6, June 2025

    ByteDance Seed Team. Introduction to techniques used in seed1.6, June 2025. ByteDance Seed official blog, published Jun 25, 2025

  12. [12]

    Kimi-VL Technical Report

    Moonshot AI et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  13. [13]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhe Chen, Weiyun Wang, Jinguo Zhu, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  14. [14]

    5 technical report

    Shiyin Lu, Yang Li, Yu Xia, et al. Ovis2.5 technical report.arXiv preprint arXiv:2508.11737, 2025

  15. [15]

    Qwen3-omni, September 2025

    Qwen Team. Qwen3-omni, September 2025. Qwen official GitHub repository, released Sep 22, 2025

  16. [16]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  17. [17]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. Qwen official blog, published Feb 15, 2026. 15