Perception Test 2025: Challenge Summary and a Unified VQA Extension
Pith reviewed 2026-05-16 15:35 UTC · model grok-4.3
The pith
Perception Test 2025 unifies video perception tasks into single interfaces to expose limits in current multimodal models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By proposing such a unified challenge, Perception Test 2025 highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces. The challenge required competitors to use unified approaches rather than engineered pipelines with task-specific models, and the unified video QA track introduced a novel subset that reformulates traditional perception tasks as multiple-choice video QA questions that video-language models can natively tackle.
What carries the argument
The unified challenge tracks that merge original tasks such as object tracking with point tracking and temporal action localisation with sound localisation, while reformulating them as native video QA problems for multimodal models.
If this is right
- Video-language models must solve point tracking and action localisation natively as multiple-choice questions rather than with dedicated components.
- Models are now evaluated on their ability to handle hour-long videos and grounded QA within the same unified interface.
- The merged tracking and localisation tracks test whether one architecture can replace separate object and sound modules.
- The open analysis and interpretability track invites submissions that explain model behavior on these unified problems.
Where Pith is reading between the lines
- Future benchmarks may adopt similar unification to discourage overfitting to isolated tasks.
- This format could encourage architectures that maintain consistent internal representations across tracking, localisation, and question answering.
- Extending the reformulation trick to additional perception problems might reveal which tasks are easiest to express as video QA.
Load-bearing premise
That forcing competitors to use unified approaches instead of engineered pipelines with task-specific models yields a more meaningful evaluation of current multimodal models.
What would settle it
A single model or approach that achieves strong results across all unified tracks without any task-specific modules or separate pipelines would contradict the reported difficulties.
read the original abstract
The Third Perception Test challenge was organised as a full-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2025. Its primary goal is to benchmark state-of-the-art video models and measure the progress in multimodal perception. This year, the workshop featured 2 guest tracks as well: KiVA (an image understanding challenge) and Physic-IQ (a video generation challenge). In this report, we summarise the results from the main Perception Test challenge, detailing both the existing tasks as well as novel additions to the benchmark. In this iteration, we placed an emphasis on task unification, as this poses a more challenging test for current SOTA multimodal models. The challenge included five consolidated tracks: unified video QA, unified object and point tracking, unified action and sound localisation, grounded video QA, and hour-long video QA, alongside an analysis and interpretability track that is still open for submissions. Notably, the unified video QA track introduced a novel subset that reformulates traditional perception tasks (such as point tracking and temporal action localisation) as multiple-choice video QA questions that video-language models can natively tackle. The unified object and point tracking merged the original object tracking and point tracking tasks, whereas the unified action and sound localisation merged the original temporal action localisation and temporal sound localisation tracks. Accordingly, we required competitors to use unified approaches rather than engineered pipelines with task-specific models. By proposing such a unified challenge, Perception Test 2025 highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript summarizes the third Perception Test challenge organized as a full-day workshop at ICCV 2025. It describes the main benchmark structure along with two guest tracks (KiVA and Physic-IQ), then details five consolidated tracks that emphasize task unification: unified video QA (including a novel subset reformulating point tracking and temporal action localisation as multiple-choice VQA questions), unified object and point tracking, unified action and sound localisation, grounded video QA, and hour-long video QA. The report notes that competitors were required to submit unified models rather than task-specific pipelines and states that the results illustrate significant difficulties current multimodal models encounter when handling diverse perception tasks through a single interface.
Significance. As a factual summary of a large-scale multimodal benchmark, the report is useful for documenting current SOTA limitations under unified evaluation protocols. The reformulation of perception tasks into native VQA format and the open analysis track are constructive additions that could encourage more integrated model development. Significance is limited, however, because the manuscript provides no quantitative head-to-head comparisons between unified and non-unified formulations, leaving the central interpretive claim about unification-induced difficulty unsupported by direct evidence.
major comments (1)
- [Abstract] Abstract (final paragraph) and the description of the unified tracks: the claim that the unified challenge 'highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces' is not backed by any direct comparison. The text states that competitors were required to use unified approaches and reports observed low performance, yet contains no results from the same architectures run on the original separate tracks, no ablation of unification cost, and no metrics against prior Perception Test editions that permitted task-specific models. Without such controls it is impossible to attribute difficulties to the unified interface rather than to inherent task hardness.
minor comments (3)
- [Results] The results section would be strengthened by reporting concrete quantitative scores (accuracy, mAP, or ranking tables) for the top submissions in each of the five tracks rather than qualitative statements about 'difficulties'.
- Add the number of participating teams and total submissions per track to give readers context on the scale and competitiveness of the 2025 edition.
- The distinction between the 'unified video QA' track and the 'grounded video QA' track is not clearly delineated in the provided text; a short table or bullet list contrasting their input/output formats and evaluation protocols would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our summary of the Perception Test 2025 challenge. We address the major comment below and will incorporate appropriate revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract (final paragraph) and the description of the unified tracks: the claim that the unified challenge 'highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces' is not backed by any direct comparison. The text states that competitors were required to use unified approaches and reports observed low performance, yet contains no results from the same architectures run on the original separate tracks, no ablation of unification cost, and no metrics against prior Perception Test editions that permitted task-specific models. Without such controls it is impossible to attribute difficulties to the unified interface rather than to inherent task hardness.
Authors: We agree that the manuscript lacks direct head-to-head comparisons between unified and task-specific models, as well as quantitative benchmarks against prior Perception Test editions that allowed separate pipelines. The 2025 challenge was deliberately structured to mandate unified models in order to evaluate integrated multimodal perception capabilities across tasks. The reported low performance levels therefore reflect results obtained under this unified protocol. However, we acknowledge that without ablations or cross-edition controls it is not possible to isolate the contribution of the unified interface from inherent task difficulty. In the revised version we will tone down the interpretive claim in the abstract and unified-tracks section to describe the observed difficulties under the required unified setting, add an explicit limitations paragraph noting the absence of such controls, and suggest comparative experiments as valuable future work. revision: partial
Circularity Check
No circularity: factual challenge summary with no derivations or self-referential loops
full rationale
The paper is a descriptive report summarizing the Perception Test 2025 workshop structure, task consolidations (e.g., unified video QA, merged tracking and localisation tracks), and the requirement for unified approaches. No equations, fitted parameters, predictions, or derivation chains exist. The claim that the unified setup 'highlights the significant difficulties existing models face' is an observational statement about the challenge design and outcomes, not a result derived from or reducing to prior self-citations, ansatzes, or inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The report remains self-contained as an external factual summary.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.