FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
Pith reviewed 2026-05-21 07:55 UTC · model grok-4.3
The pith
FineBench shows open vision-language models lag on fine-grained human activities in long videos while FineAgent improves their results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos of about 15 minutes each, with focus on detailed person movement, person interaction, object manipulation, and compositional actions. Extensive testing reveals that proprietary models such as GPT-5 reach respectable scores while current open-source VLMs significantly underperform, especially with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. FineAgent, a modular framework built around a Localizer and a Descriptor, consistently improves the performance of various open VLMs when evaluated on FineBench.
What carries the argument
FineAgent, a modular framework that augments VLMs with a Localizer to identify relevant regions and a Descriptor to generate detailed explanations, thereby targeting weaknesses in spatial and fine-grained temporal reasoning.
If this is right
- Open VLMs can reach higher accuracy on fine-grained tasks by adding modular localization and description components without full retraining.
- Spatial reasoning in multi-person scenes remains a key bottleneck that targeted modules can partially relieve.
- Dense question coverage over long videos supplies a stricter standard for measuring progress in human-centric video understanding.
- Future VLM design should explicitly address subtle distinctions in movements and interactions rather than relying on coarse action categories.
Where Pith is reading between the lines
- Modular enhancements similar to FineAgent could be tested on other human-centric video tasks such as emotion perception or group interaction analysis.
- The focus on frame-level grounding suggests that models with stronger explicit temporal tracking might show even larger gains when paired with FineAgent.
- Applying FineBench-style dense annotation to shorter clips or different cultural activity contexts could check whether the observed weaknesses are universal.
- Comparing FineAgent-augmented models against end-to-end fine-tuned versions on the same benchmark would clarify the efficiency trade-offs of the modular route.
Load-bearing premise
The dense QA annotations in FineBench accurately reflect genuine fine-grained human activity understanding without systematic annotation bias or errors.
What would settle it
An independent study that re-annotates a subset of FineBench questions with new human raters and finds substantial disagreement on what counts as the correct fine-grained answer would undermine the benchmark's validity.
Figures
read the original abstract
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FineBench, a human-centric VQA benchmark with 199,420 multiple-choice QA pairs densely annotated over 64 long-form (15-minute) videos, targeting fine-grained person movement, interactions, object manipulation, and compositional actions with frame-level spatial/temporal grounding. It reports that open VLMs underperform relative to proprietary models (e.g., GPT-5) especially on spatial reasoning in multi-person scenes, and proposes FineAgent—a modular Localizer+Descriptor framework—that yields consistent gains for open VLMs on FineBench.
Significance. If the benchmark faithfully measures genuine fine-grained deficits and the reported gains are robust, the work supplies a large-scale, long-form testbed that fills a gap left by existing human-centric benchmarks and offers a practical, modular enhancement route for current VLMs. The public project page and code are explicit strengths that support reproducibility and community follow-up.
major comments (2)
- [Dataset construction section] Dataset construction section: no inter-annotator agreement statistics, expert audit, or bias analysis are reported for the 199k dense QA pairs. This is load-bearing for the central claim that FineAgent improves performance on genuine fine-grained deficits, because unvalidated annotations on subtle multi-person spatial relations and compositional actions could introduce systematic label noise that inflates both measured weaknesses and subsequent gains.
- [Experiments section] Experiments section (evaluation of FineAgent): the claim of “consistent improvements” across open VLMs is presented without statistical significance tests, confidence intervals, or per-video variance analysis. This weakens the empirical support for the central claim that FineAgent reliably lifts performance, especially given the scale of the benchmark.
minor comments (2)
- [Abstract] Abstract: reference to “GPT-5” should be clarified (current model name or version) to avoid confusion with existing GPT-4o or o1 results.
- [Figures and tables] Figure captions and tables would benefit from explicit mention of the exact number of videos and QA pairs used in each reported split to improve traceability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on dataset validation and experimental rigor. We address each major comment below and have revised the manuscript accordingly to strengthen these aspects.
read point-by-point responses
-
Referee: [Dataset construction section] Dataset construction section: no inter-annotator agreement statistics, expert audit, or bias analysis are reported for the 199k dense QA pairs. This is load-bearing for the central claim that FineAgent improves performance on genuine fine-grained deficits, because unvalidated annotations on subtle multi-person spatial relations and compositional actions could introduce systematic label noise that inflates both measured weaknesses and subsequent gains.
Authors: We agree that quantitative validation metrics are important for establishing annotation reliability, particularly for subtle spatial and compositional elements. The original manuscript outlined the multi-stage annotation pipeline with trained annotators and quality control, but omitted explicit inter-annotator agreement figures and bias analysis. In the revision, we have added these to the Dataset Construction section: inter-annotator agreement computed on a 5% sample, a summary of expert audit results, and a bias analysis across video sources and action categories. These additions directly address potential concerns about label noise. revision: yes
-
Referee: [Experiments section] Experiments section (evaluation of FineAgent): the claim of “consistent improvements” across open VLMs is presented without statistical significance tests, confidence intervals, or per-video variance analysis. This weakens the empirical support for the central claim that FineAgent reliably lifts performance, especially given the scale of the benchmark.
Authors: We acknowledge that formal statistical validation would provide stronger support for the reported gains. The original experiments presented average improvements without significance testing or variance details. In the revised manuscript, we have incorporated paired t-tests with p-values, 95% confidence intervals, and per-video performance variance analysis in the Experiments section. These additions confirm the consistency of FineAgent improvements while addressing the scale of the benchmark. revision: yes
Circularity Check
No circularity; empirical benchmark construction and evaluation are self-contained.
full rationale
The paper constructs a new benchmark (FineBench with 199k dense QA pairs on 64 videos) and proposes an empirical enhancement framework (FineAgent using Localizer + Descriptor modules). The central claim—that FineAgent improves open VLMs on FineBench—is supported by direct experimental evaluation rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citations are load-bearing for uniqueness theorems or ansatzes, and no predictions are statistically forced from subsets of the same data. The work is a standard benchmark-plus-method paper whose results stand on new annotations and held-out testing.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human activity understanding can be reliably measured through multiple-choice QA on densely annotated long videos.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos... FineAgent integrates two key components: a Localizer... and a Descriptor...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.