Perception Test 2025: Challenge Summary and a Unified VQA Extension

Andrew Zisserman; Aravindh Mahendran; Dima Damen; Jo\~ao Carreira; Joseph Heyward; Nikhil Parthasarathy; Tyler Zhu; Viorica P\u{a}tr\u{a}ucean

arxiv: 2601.06287 · v2 · submitted 2026-01-09 · 💻 cs.CV

Perception Test 2025: Challenge Summary and a Unified VQA Extension

Joseph Heyward , Nikhil Parthasarathy , Tyler Zhu , Aravindh Mahendran , Jo\~ao Carreira , Dima Damen , Andrew Zisserman , Viorica P\u{a}tr\u{a}ucean This is my paper

Pith reviewed 2026-05-16 15:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords Perception Testvideo QAtask unificationmultimodal modelsobject trackingaction localisationvideo understandingbenchmark

0 comments

The pith

Perception Test 2025 unifies video perception tasks into single interfaces to expose limits in current multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports on the third Perception Test challenge at ICCV 2025 and its focus on task unification across five consolidated tracks. These include unified video QA, unified object and point tracking, unified action and sound localisation, grounded video QA, and hour-long video QA. The unification merges prior separate tasks and requires competitors to submit single approaches instead of separate pipelines for each problem. A sympathetic reader cares because the setup tests whether existing video-language models can handle diverse perception problems without task-specific engineering. The report also describes a new subset that turns point tracking and temporal action localisation into multiple-choice video QA questions.

Core claim

By proposing such a unified challenge, Perception Test 2025 highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces. The challenge required competitors to use unified approaches rather than engineered pipelines with task-specific models, and the unified video QA track introduced a novel subset that reformulates traditional perception tasks as multiple-choice video QA questions that video-language models can natively tackle.

What carries the argument

The unified challenge tracks that merge original tasks such as object tracking with point tracking and temporal action localisation with sound localisation, while reformulating them as native video QA problems for multimodal models.

If this is right

Video-language models must solve point tracking and action localisation natively as multiple-choice questions rather than with dedicated components.
Models are now evaluated on their ability to handle hour-long videos and grounded QA within the same unified interface.
The merged tracking and localisation tracks test whether one architecture can replace separate object and sound modules.
The open analysis and interpretability track invites submissions that explain model behavior on these unified problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future benchmarks may adopt similar unification to discourage overfitting to isolated tasks.
This format could encourage architectures that maintain consistent internal representations across tracking, localisation, and question answering.
Extending the reformulation trick to additional perception problems might reveal which tasks are easiest to express as video QA.

Load-bearing premise

That forcing competitors to use unified approaches instead of engineered pipelines with task-specific models yields a more meaningful evaluation of current multimodal models.

What would settle it

A single model or approach that achieves strong results across all unified tracks without any task-specific modules or separate pipelines would contradict the reported difficulties.

read the original abstract

The Third Perception Test challenge was organised as a full-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2025. Its primary goal is to benchmark state-of-the-art video models and measure the progress in multimodal perception. This year, the workshop featured 2 guest tracks as well: KiVA (an image understanding challenge) and Physic-IQ (a video generation challenge). In this report, we summarise the results from the main Perception Test challenge, detailing both the existing tasks as well as novel additions to the benchmark. In this iteration, we placed an emphasis on task unification, as this poses a more challenging test for current SOTA multimodal models. The challenge included five consolidated tracks: unified video QA, unified object and point tracking, unified action and sound localisation, grounded video QA, and hour-long video QA, alongside an analysis and interpretability track that is still open for submissions. Notably, the unified video QA track introduced a novel subset that reformulates traditional perception tasks (such as point tracking and temporal action localisation) as multiple-choice video QA questions that video-language models can natively tackle. The unified object and point tracking merged the original object tracking and point tracking tasks, whereas the unified action and sound localisation merged the original temporal action localisation and temporal sound localisation tracks. Accordingly, we required competitors to use unified approaches rather than engineered pipelines with task-specific models. By proposing such a unified challenge, Perception Test 2025 highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clear challenge summary that unifies several video perception tasks and adds a VQA reformulation, but the claim about unified interfaces exposing extra difficulty lacks direct comparisons.

read the letter

The main takeaway is that this report updates the Perception Test by merging tracks into unified versions and introducing a new subset that turns tasks like point tracking and action localisation into multiple-choice video QA questions. That reformulation is the freshest part, since it lets video-language models handle them natively without custom heads. The paper lays out the five consolidated tracks, the guest challenges, and the workshop outcomes in straightforward terms, which makes the structure easy to follow. Requiring single unified models instead of task-specific pipelines is a reasonable push toward generality, and the description of where models struggled is factual and useful for tracking progress. The soft spot is the assertion that unified interfaces highlight significant difficulties. The report states this but shows no side-by-side results comparing the same models on unified versus separate task formulations, nor any numbers against prior Perception Test editions that allowed engineered pipelines. Without those, low scores could just reflect task difficulty rather than the cost of unification. This kind of paper is mainly for researchers working on multimodal video models who want a snapshot of current benchmark results and the new QA framing. A reader building generalist systems would get practical value from the track details and observed failure modes. It deserves a serious referee because benchmark summaries like this document field progress, even if they would be stronger with the missing baseline comparisons added. I would recommend sending it to peer review.

Referee Report

1 major / 3 minor

Summary. The manuscript summarizes the third Perception Test challenge organized as a full-day workshop at ICCV 2025. It describes the main benchmark structure along with two guest tracks (KiVA and Physic-IQ), then details five consolidated tracks that emphasize task unification: unified video QA (including a novel subset reformulating point tracking and temporal action localisation as multiple-choice VQA questions), unified object and point tracking, unified action and sound localisation, grounded video QA, and hour-long video QA. The report notes that competitors were required to submit unified models rather than task-specific pipelines and states that the results illustrate significant difficulties current multimodal models encounter when handling diverse perception tasks through a single interface.

Significance. As a factual summary of a large-scale multimodal benchmark, the report is useful for documenting current SOTA limitations under unified evaluation protocols. The reformulation of perception tasks into native VQA format and the open analysis track are constructive additions that could encourage more integrated model development. Significance is limited, however, because the manuscript provides no quantitative head-to-head comparisons between unified and non-unified formulations, leaving the central interpretive claim about unification-induced difficulty unsupported by direct evidence.

major comments (1)

[Abstract] Abstract (final paragraph) and the description of the unified tracks: the claim that the unified challenge 'highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces' is not backed by any direct comparison. The text states that competitors were required to use unified approaches and reports observed low performance, yet contains no results from the same architectures run on the original separate tracks, no ablation of unification cost, and no metrics against prior Perception Test editions that permitted task-specific models. Without such controls it is impossible to attribute difficulties to the unified interface rather than to inherent task hardness.

minor comments (3)

[Results] The results section would be strengthened by reporting concrete quantitative scores (accuracy, mAP, or ranking tables) for the top submissions in each of the five tracks rather than qualitative statements about 'difficulties'.
Add the number of participating teams and total submissions per track to give readers context on the scale and competitiveness of the 2025 edition.
The distinction between the 'unified video QA' track and the 'grounded video QA' track is not clearly delineated in the provided text; a short table or bullet list contrasting their input/output formats and evaluation protocols would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our summary of the Perception Test 2025 challenge. We address the major comment below and will incorporate appropriate revisions.

read point-by-point responses

Referee: [Abstract] Abstract (final paragraph) and the description of the unified tracks: the claim that the unified challenge 'highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces' is not backed by any direct comparison. The text states that competitors were required to use unified approaches and reports observed low performance, yet contains no results from the same architectures run on the original separate tracks, no ablation of unification cost, and no metrics against prior Perception Test editions that permitted task-specific models. Without such controls it is impossible to attribute difficulties to the unified interface rather than to inherent task hardness.

Authors: We agree that the manuscript lacks direct head-to-head comparisons between unified and task-specific models, as well as quantitative benchmarks against prior Perception Test editions that allowed separate pipelines. The 2025 challenge was deliberately structured to mandate unified models in order to evaluate integrated multimodal perception capabilities across tasks. The reported low performance levels therefore reflect results obtained under this unified protocol. However, we acknowledge that without ablations or cross-edition controls it is not possible to isolate the contribution of the unified interface from inherent task difficulty. In the revised version we will tone down the interpretive claim in the abstract and unified-tracks section to describe the observed difficulties under the required unified setting, add an explicit limitations paragraph noting the absence of such controls, and suggest comparative experiments as valuable future work. revision: partial

Circularity Check

0 steps flagged

No circularity: factual challenge summary with no derivations or self-referential loops

full rationale

The paper is a descriptive report summarizing the Perception Test 2025 workshop structure, task consolidations (e.g., unified video QA, merged tracking and localisation tracks), and the requirement for unified approaches. No equations, fitted parameters, predictions, or derivation chains exist. The claim that the unified setup 'highlights the significant difficulties existing models face' is an observational statement about the challenge design and outcomes, not a result derived from or reducing to prior self-citations, ansatzes, or inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The report remains self-contained as an external factual summary.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a benchmark challenge summary report and introduces no mathematical models, free parameters, axioms, or new entities.

pith-pipeline@v0.9.0 · 5621 in / 951 out tokens · 44131 ms · 2026-05-16T15:35:04.016844+00:00 · methodology

Perception Test 2025: Challenge Summary and a Unified VQA Extension

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)