FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

Gueter Josmy Faure; Hung-Ting Su; Jia-Fong Yeh; Min-Hung Chen; Winston H. Hsu

arxiv: 2605.19846 · v3 · pith:QQ76WJIFnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.CL

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

Gueter Josmy Faure , Min-Hung Chen , Jia-Fong Yeh , Hung-Ting Su , Winston H. Hsu This is my paper

Pith reviewed 2026-05-21 07:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords vision-language modelsfine-grained video understandinghuman activity recognitionvideo question answeringspatial reasoningmulti-person scenesmodular frameworklong-form video benchmark

0 comments

The pith

FineBench shows open vision-language models lag on fine-grained human activities in long videos while FineAgent improves their results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FineBench, a benchmark with nearly 200,000 dense multiple-choice questions across 64 long videos to test detailed comprehension of person movements, interactions, and object manipulations. Evaluations indicate that open-source VLMs perform poorly compared to proprietary models, particularly on spatial reasoning in multi-person scenes and on subtle distinctions in actions. FineAgent is presented as a modular add-on using a Localizer to pinpoint regions and a Descriptor to explain details, yielding consistent gains across several open models. A sympathetic reader would care because applications such as robotics, surveillance, and assistive systems require precisely this level of nuanced human activity understanding rather than broad scene summaries.

Core claim

FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos of about 15 minutes each, with focus on detailed person movement, person interaction, object manipulation, and compositional actions. Extensive testing reveals that proprietary models such as GPT-5 reach respectable scores while current open-source VLMs significantly underperform, especially with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. FineAgent, a modular framework built around a Localizer and a Descriptor, consistently improves the performance of various open VLMs when evaluated on FineBench.

What carries the argument

FineAgent, a modular framework that augments VLMs with a Localizer to identify relevant regions and a Descriptor to generate detailed explanations, thereby targeting weaknesses in spatial and fine-grained temporal reasoning.

If this is right

Open VLMs can reach higher accuracy on fine-grained tasks by adding modular localization and description components without full retraining.
Spatial reasoning in multi-person scenes remains a key bottleneck that targeted modules can partially relieve.
Dense question coverage over long videos supplies a stricter standard for measuring progress in human-centric video understanding.
Future VLM design should explicitly address subtle distinctions in movements and interactions rather than relying on coarse action categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Modular enhancements similar to FineAgent could be tested on other human-centric video tasks such as emotion perception or group interaction analysis.
The focus on frame-level grounding suggests that models with stronger explicit temporal tracking might show even larger gains when paired with FineAgent.
Applying FineBench-style dense annotation to shorter clips or different cultural activity contexts could check whether the observed weaknesses are universal.
Comparing FineAgent-augmented models against end-to-end fine-tuned versions on the same benchmark would clarify the efficiency trade-offs of the modular route.

Load-bearing premise

The dense QA annotations in FineBench accurately reflect genuine fine-grained human activity understanding without systematic annotation bias or errors.

What would settle it

An independent study that re-annotates a subset of FineBench questions with new human raters and finds substantial disagreement on what counts as the correct fine-grained answer would undermine the benchmark's validity.

Figures

Figures reproduced from arXiv: 2605.19846 by Gueter Josmy Faure, Hung-Ting Su, Jia-Fong Yeh, Min-Hung Chen, Winston H. Hsu.

**Figure 1.** Figure 1: (a) Examples of question types in FineBench which go beyond summarization to cover person posture, person-object interaction, and person-person interaction. (b) The capture of temporal evolution of interaction labels across frames, emphasizing spatial granularity (e.g., distinguish individuals in the same frame) and temporal granularity (e.g., resolving transitions between similar but distinct actions). Ab… view at source ↗

**Figure 2.** Figure 2: Distribution of Annotated Persons per Keyframe. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: VLM performance analysis on FineBench detailing accuracy variations. (a) Performance degradation with increasing number [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Workflow of FineAgent. It begins with (1) prompt activation for the Localizer and Descriptor. (2) The Localizer and Descriptor, both Foundation models, provide bounding box coordinates and textual captions. (3) Finally, the VLM uses this processed information during inference. interactions compared to object-centric actions. To address these limitations, we propose FineAgent, a modular framework design… view at source ↗

read the original abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FineBench adds a dense long-video QA set for human activities and FineAgent gives open VLMs a modular lift on spatial and subtle-action tasks, but the annotation validation details are thin.

read the letter

Colleague, the main thing to know is that this paper ships FineBench, a benchmark with 199k multiple-choice QA pairs over 64 fifteen-minute videos, plus FineAgent, a two-module setup that improves open VLMs on the new questions. The videos target person movement, interactions, object handling, and compositional actions with frame-level grounding, which is denser coverage than most prior human-centric sets. Their tests show closed models like GPT-5 hold up better while open ones struggle on multi-person spatial reasoning and small movement differences, and FineAgent's Localizer plus Descriptor produces consistent gains across several open models they tried. That combination of scale and a practical fix is the concrete contribution here. The soft spot is the missing checks on the annotations themselves. The abstract gives no inter-annotator agreement numbers or audit results, and at this density on subtle, multi-person scenes even modest label noise could distort both the measured weaknesses and the size of the FineAgent improvements. I would want to see that section before trusting how much the benchmark actually isolates genuine model deficits. This is aimed at groups working on video-language models for robotics or surveillance who need harder human-activity tests. A reader who evaluates open VLMs or builds on existing benchmarks would find the failure modes and the proposed modules useful to discuss. The work has enough new data and empirical results to go to a serious referee, though the review should focus on tightening the annotation validation and checking whether the gains hold under different splits. I would recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces FineBench, a human-centric VQA benchmark with 199,420 multiple-choice QA pairs densely annotated over 64 long-form (15-minute) videos, targeting fine-grained person movement, interactions, object manipulation, and compositional actions with frame-level spatial/temporal grounding. It reports that open VLMs underperform relative to proprietary models (e.g., GPT-5) especially on spatial reasoning in multi-person scenes, and proposes FineAgent—a modular Localizer+Descriptor framework—that yields consistent gains for open VLMs on FineBench.

Significance. If the benchmark faithfully measures genuine fine-grained deficits and the reported gains are robust, the work supplies a large-scale, long-form testbed that fills a gap left by existing human-centric benchmarks and offers a practical, modular enhancement route for current VLMs. The public project page and code are explicit strengths that support reproducibility and community follow-up.

major comments (2)

[Dataset construction section] Dataset construction section: no inter-annotator agreement statistics, expert audit, or bias analysis are reported for the 199k dense QA pairs. This is load-bearing for the central claim that FineAgent improves performance on genuine fine-grained deficits, because unvalidated annotations on subtle multi-person spatial relations and compositional actions could introduce systematic label noise that inflates both measured weaknesses and subsequent gains.
[Experiments section] Experiments section (evaluation of FineAgent): the claim of “consistent improvements” across open VLMs is presented without statistical significance tests, confidence intervals, or per-video variance analysis. This weakens the empirical support for the central claim that FineAgent reliably lifts performance, especially given the scale of the benchmark.

minor comments (2)

[Abstract] Abstract: reference to “GPT-5” should be clarified (current model name or version) to avoid confusion with existing GPT-4o or o1 results.
[Figures and tables] Figure captions and tables would benefit from explicit mention of the exact number of videos and QA pairs used in each reported split to improve traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on dataset validation and experimental rigor. We address each major comment below and have revised the manuscript accordingly to strengthen these aspects.

read point-by-point responses

Referee: [Dataset construction section] Dataset construction section: no inter-annotator agreement statistics, expert audit, or bias analysis are reported for the 199k dense QA pairs. This is load-bearing for the central claim that FineAgent improves performance on genuine fine-grained deficits, because unvalidated annotations on subtle multi-person spatial relations and compositional actions could introduce systematic label noise that inflates both measured weaknesses and subsequent gains.

Authors: We agree that quantitative validation metrics are important for establishing annotation reliability, particularly for subtle spatial and compositional elements. The original manuscript outlined the multi-stage annotation pipeline with trained annotators and quality control, but omitted explicit inter-annotator agreement figures and bias analysis. In the revision, we have added these to the Dataset Construction section: inter-annotator agreement computed on a 5% sample, a summary of expert audit results, and a bias analysis across video sources and action categories. These additions directly address potential concerns about label noise. revision: yes
Referee: [Experiments section] Experiments section (evaluation of FineAgent): the claim of “consistent improvements” across open VLMs is presented without statistical significance tests, confidence intervals, or per-video variance analysis. This weakens the empirical support for the central claim that FineAgent reliably lifts performance, especially given the scale of the benchmark.

Authors: We acknowledge that formal statistical validation would provide stronger support for the reported gains. The original experiments presented average improvements without significance testing or variance details. In the revised manuscript, we have incorporated paired t-tests with p-values, 95% confidence intervals, and per-video performance variance analysis in the Experiments section. These additions confirm the consistency of FineAgent improvements while addressing the scale of the benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark construction and evaluation are self-contained.

full rationale

The paper constructs a new benchmark (FineBench with 199k dense QA pairs on 64 videos) and proposes an empirical enhancement framework (FineAgent using Localizer + Descriptor modules). The central claim—that FineAgent improves open VLMs on FineBench—is supported by direct experimental evaluation rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citations are load-bearing for uniqueness theorems or ansatzes, and no predictions are statistically forced from subsets of the same data. The work is a standard benchmark-plus-method paper whose results stand on new annotations and held-out testing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on the assumption that the 199,420 QA pairs were annotated consistently and that the Localizer and Descriptor modules provide independent value beyond the base VLM. No free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Human activity understanding can be reliably measured through multiple-choice QA on densely annotated long videos.
The benchmark design and model evaluation presuppose that this format captures the intended fine-grained capabilities.

pith-pipeline@v0.9.0 · 5825 in / 1276 out tokens · 46799 ms · 2026-05-21T07:55:42.216396+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos... FineAgent integrates two key components: a Localizer... and a Descriptor...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.