HumanMoveVQA: Can Video MLLMs reason about human movement in videos?

Adrian Hilton; Armin Mustafa; Asmar Nadeem; Faegheh Sardari; Padraig Boulton; Pulkit Gera; Valentina Bono

arxiv: 2606.27999 · v2 · pith:XYQABNKDnew · submitted 2026-06-26 · 💻 cs.CV

HumanMoveVQA: Can Video MLLMs reason about human movement in videos?

Pulkit Gera , Faegheh Sardari , Asmar Nadeem , Valentina Bono , Padraig Boulton , Adrian Hilton , Armin Mustafa This is my paper

Pith reviewed 2026-06-29 04:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords HumanMoveVQAvideo MLLMshuman motion understandingtrajectory reasoningorientation changes3D motion tracksworld coordinate systemfine-tuning

0 comments

The pith

Video MLLMs fail at global human trajectory and orientation reasoning but improve markedly when fine-tuned on world-consistent 3D motion data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HumanMoveVQA as a benchmark that tests whether multimodal large language models can track how people move through space over time, including changes in position and facing direction from an outside viewpoint. It anchors everything to a fixed starting frame so that motion remains consistent in a shared 3D world rather than drifting with the camera. The authors build more than ten thousand question-answer pairs across categories such as combining multiple motion segments and ordering events along a path. Tests show that leading proprietary models perform poorly on these questions, yet the same models gain substantially after training on the new data. This indicates the missing skill can be acquired when supervision respects actual geometry instead of coarse scene labels.

Core claim

HumanMoveVQA shows that current video MLLMs reduce complex human motion to broad semantic labels and cannot reliably answer questions about global trajectories or orientation shifts, but fine-tuning an open-source model on the benchmark's world-consistent 3D supervision produces clear gains across the seven reasoning categories.

What carries the argument

The multi-stage pipeline that converts 2D video frames into world-consistent 3D motion tracks anchored to the first frame, generating structured QA pairs that test trajectory-level and orientation reasoning.

If this is right

Fine-tuned open-source models can outperform proprietary ones on global human motion tasks when given the same world-consistent supervision.
Video understanding systems can move beyond local joint or scene labels to handle trajectory aggregation and sequential ordering questions.
The seven reasoning categories supply a structured way to measure progress on movement-aware video models.
A geometric, first-frame-anchored coordinate system provides a repeatable foundation for generating motion QA data at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lifting pipeline could be applied to non-human moving objects such as vehicles or animals to create similar benchmarks.
Models trained this way may transfer better to downstream tasks that require predicting future paths from observed motion.
The gap between proprietary and fine-tuned performance suggests other video reasoning shortfalls could also close with targeted geometric data rather than scale alone.

Load-bearing premise

The pipeline that lifts 2D observations into 3D motion tracks keeps translation and rotation accurate relative to the fixed starting point without introducing errors that would invalidate the generated questions and answers.

What would settle it

A direct comparison of the pipeline's 3D tracks against ground-truth motion capture data on the same videos that reveals systematic drift in position or rotation, or a re-run of the fine-tuning experiment on a fresh set of real videos that shows no accuracy gain.

Figures

Figures reproduced from arXiv: 2606.27999 by Adrian Hilton, Armin Mustafa, Asmar Nadeem, Faegheh Sardari, Padraig Boulton, Pulkit Gera, Valentina Bono.

**Figure 2.** Figure 2: Overview of pipeline generating HumanMoveVQA. Given an input video, we use [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Motion characteristics across the three full datasets before train/test splitting. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on the EMDB dataset. The world-view depicts extracted 3D SMPL-X poses and is shown for illustration only (not provided to the models). Green denotes the correct option. We compare predictions from multiple MLLMs across seven reasoning categories, where our model demonstrates more accurate and consistent responses. More results in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Heatmap visualizations of training strategies and cross-dataset generalization. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results on the EMDB dataset. The world-view visualization depicts extracted 3D SMPL-X poses and is shown for illustration only (not provided to the models). Green denotes the correct option. We compare predictions from multiple MLLMs across seven reasoning categories, where our model demonstrates more accurate and consistent responses. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results on the RICH dataset. The world-view visualization depicts extracted 3D SMPL-X poses and is shown for illustration only (not provided to the models). Green denotes the correct option. We compare predictions from multiple MLLMs across seven reasoning categories, where our model demonstrates more accurate and consistent responses. EgoBody – The EgoBody consists of static multi-view recordi… view at source ↗

**Figure 8.** Figure 8: Qualitative results on the EgoBody dataset. The world-view visualization depicts extracted 3D SMPL-X poses and is shown for illustration only (not provided to the models). Green denotes the correct option. We compare predictions from multiple MLLMs across seven reasoning categories, where our model demonstrates more accurate and consistent responses. Effect of Frame Count – We evaluated models trained with… view at source ↗

**Figure 9.** Figure 9: Performance comparison across reasoning categories on the HumanMoveVQAbenchmark. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Effect of Frame Count. Qwen3-VL 8B is fine-tuned (SFT) with varying numbers of input [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Effect of input resolution. Qwen3-VL 8B is fine-tuned (SFT) at different input resolutions [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: System prompt used for evaluation of models. We use the same system prompt for training [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt used for clothing caption generation. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Examples of category-wise reasoning traces for a sample EMDB video. For each category, [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

read the original abstract

Despite the rapid advance of Multimodal Large Language Models (MLLMs) in high-level video understanding, a fundamental bottleneck remains: these models collapse complex human motion into coarse semantic labels. Existing benchmarks mostly focus on scene-centric events or local joint articulations, failing to probe global human motion in space over time (trajectory and orientation changes). We introduce HumanMoveVQA, the first comprehensive benchmark designed to evaluate global trajectory and orientation reasoning from an exocentric perspective. Our benchmark utilizes a first-frame anchored world coordinate system, preserving translation and rotation relative to a fixed starting point. We propose a scalable, multi-stage pipeline that lifts 2D video observations into world-consistent 3D motion tracks to generate over 10K structured question-answer pairs across seven reasoning categories, including motion aggregation, sequential ordering, and trajectory-level inference. Our extensive evaluation reveals a critical capability gap in state-of-the-art proprietary models on deep human motion understanding. However, we demonstrate that this is a learnable problem; by fine-tuning an open-source baseline with our targeted, world-consistent supervision, we achieve a significant improvement. HumanMoveVQA establishes a rigorous geometric foundation for developing next-generation, movement-aware video understanding models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark targets global human motion reasoning but the 2D-to-3D lifting step has no reported validation against ground truth.

read the letter

The paper puts forward HumanMoveVQA as the first benchmark focused on exocentric global trajectory and orientation changes for humans in video, using a first-frame anchored world coordinate system and seven reasoning categories. It generates over 10K QA pairs via a multi-stage 2D-to-3D lifting pipeline and shows that current proprietary MLLMs perform poorly while fine-tuning an open model on their data yields gains.

What is actually new is the explicit targeting of world-consistent global motion rather than local joints or scene events. The setup distinguishes itself from prior benchmarks by preserving translation and rotation relative to a fixed start point, which matches needs in robotics and surveillance.

The work does a reasonable job identifying the capability gap and demonstrating that targeted supervision can close some of it. The abstract is clear on the motivation and the high-level construction.

The main soft spot is the lifting pipeline itself. No error metrics, drift measurements, or held-out comparisons to mocap or other 3D ground truth appear in the provided text. If the 3D tracks contain systematic translation or rotation errors, the QA pairs and the fine-tuning signal become unreliable, which directly affects the central claim about a learnable gap. This is not a minor detail; it is the foundation for everything downstream.

The paper is aimed at groups working on video MLLMs that need geometric motion understanding. A reader already building or evaluating such models would get value from the task definitions and the fine-tuning result, provided the data quality checks out.

It deserves a serious referee to examine the pipeline validation and dataset statistics in the full manuscript. I would send it for review rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The paper introduces HumanMoveVQA, the first benchmark targeting global human trajectory and orientation reasoning in videos from an exocentric perspective. It describes a first-frame-anchored world coordinate system and a multi-stage 2D-to-3D lifting pipeline used to generate over 10K QA pairs across seven categories (motion aggregation, sequential ordering, trajectory inference, etc.). Evaluations on the benchmark reveal capability gaps in state-of-the-art proprietary video MLLMs, while fine-tuning an open-source baseline with the generated world-consistent supervision yields significant gains.

Significance. If the 3D lifting pipeline is shown to be accurate, the work supplies a geometrically grounded benchmark that addresses a clear gap in existing video-understanding evaluations, which largely emphasize semantic events or local joint motion rather than global trajectory and orientation changes over time. The explicit demonstration that the observed gap is learnable via targeted supervision is a constructive finding for the field.

major comments (1)

[Method section describing the 2D-to-3D lifting pipeline] The multi-stage 2D-to-3D lifting pipeline (described in the method section) is load-bearing for the entire benchmark and all downstream claims, yet the manuscript reports no quantitative validation against ground-truth 3D data (e.g., position/orientation drift or mocap error on held-out sequences). Systematic biases in translation or rotation relative to the first-frame anchor would directly corrupt the seven reasoning categories and the fine-tuning supervision.

minor comments (1)

[Abstract] The abstract states results at a high level but omits any reference to dataset statistics, error analysis, or validation of the generated QA pairs, making it difficult to assess reliability from the provided description alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the benchmark's significance. We address the single major comment point by point below.

read point-by-point responses

Referee: [Method section describing the 2D-to-3D lifting pipeline] The multi-stage 2D-to-3D lifting pipeline (described in the method section) is load-bearing for the entire benchmark and all downstream claims, yet the manuscript reports no quantitative validation against ground-truth 3D data (e.g., position/orientation drift or mocap error on held-out sequences). Systematic biases in translation or rotation relative to the first-frame anchor would directly corrupt the seven reasoning categories and the fine-tuning supervision.

Authors: We agree that the 2D-to-3D lifting pipeline is central to the benchmark and that the absence of quantitative validation against ground-truth 3D data is a limitation. The pipeline composes established off-the-shelf components (2D pose estimation, monocular depth, and camera pose estimation) with a first-frame anchoring step, but we did not report end-to-end error metrics on held-out mocap sequences. In the revised manuscript we will add a dedicated validation subsection that measures position and orientation drift on sequences with available 3D ground truth, thereby quantifying any systematic biases relative to the anchor frame. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark pipeline and evaluations are independent of each other

full rationale

The paper constructs HumanMoveVQA via a described multi-stage 2D-to-3D lifting pipeline to produce QA pairs, then separately evaluates proprietary and open-source MLLMs on those pairs and shows fine-tuning gains. No equations, fitted parameters, or self-citations are presented as load-bearing derivations. The pipeline is an input method for data generation, not a quantity derived from or equivalent to the model results. The central claims (capability gap + learnability) rest on external model evaluations rather than reducing to the pipeline definition itself. This matches the default expectation of a non-circular benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the 3D lifting pipeline is mentioned but not detailed enough to extract any.

pith-pipeline@v0.9.1-grok · 5764 in / 1128 out tokens · 42602 ms · 2026-06-29T04:50:47.155236+00:00 · methodology

HumanMoveVQA: Can Video MLLMs reason about human movement in videos?

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)