pith. sign in

arxiv: 2605.01496 · v1 · submitted 2026-05-02 · 💻 cs.CV

SF20K Competition 2025: Summary and findings

Pith reviewed 2026-05-09 14:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords long-form video QAstory-level video understandingshort filmsvideo question answeringmultimodal reasoninginformation selectionbenchmark competition
0
0 comments X

The pith

The SF20K competition shows that information selection and reasoning structure, not raw model capacity, limit long-form video question answering, with top systems at 65.7 percent versus 91.7 percent for humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports results from the first SF20K Competition, which tests story-level video understanding via open-ended questions on 95 amateur short films containing 979 question-answer pairs. Twenty-two teams submitted entries, with the winner reaching 65.7 percent accuracy on the unrestricted main track and 48.7 percent on the track limited to models under 8 billion parameters. Analysis of submissions finds that narrative-aware processing at the shot level outperforms uniform frame sampling, that carefully designed multi-stage pipelines allow smaller models to match or exceed much larger ones, and that subtitle quality strongly influences outcomes. These patterns indicate the core difficulty lies in selecting relevant details and organizing reasoning about narratives rather than in overall model scale, leaving a clear gap to human-level performance.

Core claim

In the SF20K-Test benchmark of amateur short films and open-ended questions, leading methods rely on narrative-aware shot-level video processing together with multi-stage pipelines; this allows models below 8 billion parameters to approach the performance of models over 30 times larger, yet all entries remain well below the human ceiling of 91.7 percent, showing that the dominant limits are in information selection and reasoning structure.

What carries the argument

The SF20K-Test benchmark of 95 short films and 979 questions, scored by the LLM-QA-Eval judge, which forces reliance on multimodal story comprehension instead of memorization of popular content.

If this is right

  • Narrative-aware shot-level processing outperforms uniform frame sampling for long videos.
  • Multi-stage pipelines with smaller models can equal or beat end-to-end inference on models over 30 times larger.
  • Subtitle quality is a dominant factor in overall performance on story-level questions.
  • The main bottleneck is information selection and reasoning structure rather than model size.
  • A substantial gap persists between current methods and human-level narrative comprehension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future systems may benefit more from improved mechanisms for selecting and summarizing relevant shots than from further increases in parameter count.
  • The benchmark's use of amateur films reduces the risk of models relying on pre-trained movie knowledge, making it useful for testing genuine story understanding.
  • Integrating stronger subtitle processing or explicit narrative tracking modules could narrow the observed performance gap.
  • The results suggest testing whether hybrid approaches that combine small selection models with larger reasoning models can close more of the gap to humans.

Load-bearing premise

The automated LLM-QA-Eval judge based on GPT-4.1-nano gives a reliable and unbiased measure of answer quality that matches human judgment on open-ended video questions.

What would settle it

A direct comparison of LLM-QA-Eval scores against human ratings on the same set of model answers from the competition submissions.

Figures

Figures reproduced from arXiv: 2605.01496 by Ivan Laptev, Ridouane Ghermi, Vicky Kalogeiton, Xi Wang.

Figure 1
Figure 1. Figure 1: A sample from the SF20K competition, including the movie title, a sentence describing the story, a few frames with their corresponding captions, a question and a ground-truth answer. In this example, we display the specific timestamps that enables to answer the question. Abstract This report presents the results and findings of the first edition of the Short-Films 20K (SF20K) Com￾petition, held in conjunct… view at source ↗
Figure 2
Figure 2. Figure 2: The LLM-QA-Eval metric. Given a question, the metric uses an LLM (i.e., GPT-4.1-nano) to compare a predicted answer to the ground-truth answer, assigning a 1-5 score and a 0-1 correctness label. The final metric is the average correctness across all N = 979 samples. move beyond frame-level feature extraction toward holistic, narrative-driven reasoning—encompassing visual storytelling, character identificat… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study. Qwen2.5-VL-7B’s performance depending on the input modality (left) and number of input frames (right). reached 59.7%. The significant gap between the SOTA and Human Performance (91.7%) demonstrates that there is room for improvement and highlights the difficulty of narrative understanding from current methods. Ablation study. Our experiments demonstrate that the fusion of visual data and su… view at source ↗
Figure 4
Figure 4. Figure 4: Teams’ performance on the public leaderboard over the course of the competition. Track Rank Team Name Public Acc. (%) Private Acc. (%) Main 1 WXYZ 49.0 65.7 2 sudook 53.1 65.3 3 BASELINE 40.8 59.8 4 R&T 55.9 58.5 5 eg 46.6 53.5 Special 1 WXYZ 48.3 48.7 2 eloral-wxy 47.2 46.2 3 sudook 39.2 44.8 4 R&T 46.4 42.4 5 eg 45.5 - view at source ↗
read the original abstract

This report presents the results and findings of the first edition of the Short-Films 20K (SF20K) Competition, held in conjunction with the SLoMO Workshop at ICCV 2025. The competition is designed to advance story-level video understanding beyond short-clip action recognition, introducing an open-ended video question-answering task built on a corpus of amateur short films. This setup ensures that models must rely on multimodal understanding rather than memorization of popular movies. Evaluation is conducted using the SF20K-Test benchmark (95 movies, 979 question-answer pairs) and scored via LLM-QA-Eval, an automated judge based on GPT-4.1-nano. The competition attracted 22 teams and 286 submissions across two tracks: a Main Track with unrestricted model size and a Special Track limited to models under 8 billion parameters. The winning team achieved 65.7% accuracy on the Main Track and 48.7% on the Special Track, against a human performance ceiling of 91.7%. Our analysis reveals several key findings: narrative-aware, shot-level processing consistently outperforms uniform frame sampling; well-designed multi-stage pipelines using smaller models can match or exceed end-to-end inference with models over 30x larger; and subtitle quality is a dominant factor in performance. These results highlight that the primary bottleneck in long-form video QA lies in information selection and reasoning structure rather than raw model capacity, and that a substantial gap remains between current methods and human-level narrative comprehension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This paper summarizes the SF20K Competition 2025 results for open-ended long-form video QA on a corpus of 95 amateur short films (SF20K-Test: 979 QA pairs). It reports 22 teams and 286 submissions across a Main Track (unrestricted model size, winner 65.7%) and Special Track (models <8B parameters, winner 48.7%), with a human ceiling of 91.7%. Evaluation uses the LLM-QA-Eval automated judge based on GPT-4.1-nano. Key observational findings are that narrative-aware shot-level processing outperforms uniform sampling, multi-stage pipelines with smaller models can match or exceed much larger end-to-end models, subtitle quality is a dominant factor, the primary bottleneck is information selection and reasoning structure rather than raw capacity, and a substantial gap remains to human-level narrative comprehension.

Significance. If the LLM-QA-Eval judge is shown to align with human judgments, the results would be significant for video understanding research by providing empirical evidence from a controlled competition that targeted processing strategies and reasoning pipelines can be more impactful than model scale alone, while establishing a new benchmark that avoids memorization of popular media. The competition format and participation numbers also offer a useful snapshot of current method capabilities on narrative video QA.

major comments (2)
  1. [Abstract] Abstract: The central claims that 'the primary bottleneck in long-form video QA lies in information selection and reasoning structure rather than raw model capacity' and that narrative-aware shot-level processing is superior rest entirely on performance rankings and comparisons produced by the GPT-4.1-nano-based LLM-QA-Eval judge. No correlation study, human validation, or inter-rater agreement metrics are reported for this judge on the 979 SF20K-Test questions, so the observed patterns (including the 65.7%/48.7% scores and the gap to the 91.7% human ceiling) cannot be confidently interpreted as reflecting true capabilities rather than judge-specific biases in answer style or content.
  2. [Abstract] Abstract: The human performance ceiling of 91.7% is stated without any description of the evaluation protocol used for humans (e.g., whether humans answered directly and were scored by the same LLM judge, or were assessed by other humans), which is load-bearing for interpreting the claimed gap between current methods and human-level narrative comprehension.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it briefly noted the total number of submissions per track and any controls applied to prevent data leakage or memorization beyond the use of amateur films.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and support for the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims that 'the primary bottleneck in long-form video QA lies in information selection and reasoning structure rather than raw model capacity' and that narrative-aware shot-level processing is superior rest entirely on performance rankings and comparisons produced by the GPT-4.1-nano-based LLM-QA-Eval judge. No correlation study, human validation, or inter-rater agreement metrics are reported for this judge on the 979 SF20K-Test questions, so the observed patterns (including the 65.7%/48.7% scores and the gap to the 91.7% human ceiling) cannot be confidently interpreted as reflecting true capabilities rather than judge-specific biases in answer style or content.

    Authors: We agree that the lack of a dedicated correlation study between LLM-QA-Eval and human judgments on the SF20K-Test set weakens the evidential basis for interpreting the observed performance patterns and bottlenecks. The judge follows the design from our prior related work, where it demonstrated reasonable alignment, but no such analysis was performed or reported specifically for the 979 questions in this competition summary. In the revised manuscript, we will expand the methods section to include additional details on the judge's prompting strategy, any available prior validation metrics, and a discussion of potential biases. However, conducting a new full correlation or inter-rater study on the complete test set would require fresh human annotations that are not currently available. revision: partial

  2. Referee: [Abstract] Abstract: The human performance ceiling of 91.7% is stated without any description of the evaluation protocol used for humans (e.g., whether humans answered directly and were scored by the same LLM judge, or were assessed by other humans), which is load-bearing for interpreting the claimed gap between current methods and human-level narrative comprehension.

    Authors: We appreciate this observation and agree that the protocol must be described for proper interpretation of the human-model gap. Human performance was measured by having participants watch the full videos and provide open-ended answers, which were then scored using the identical LLM-QA-Eval judge to maintain consistency with model evaluation. We will revise the abstract to include a brief mention of this protocol and add a dedicated paragraph in the main text detailing the human evaluation setup, including instructions to annotators, number of participants, and computation of the 91.7% score. revision: yes

standing simulated objections not resolved
  • A full correlation study, human validation, and inter-rater agreement metrics for the LLM-QA-Eval judge on the complete set of 979 SF20K-Test questions

Circularity Check

0 steps flagged

Purely empirical competition report with no derivations or self-referential reductions

full rationale

The paper is a summary of competition results on the SF20K-Test benchmark, reporting submission scores, track winners, and observational findings from 22 teams. No equations, parameter fits, or derivation chains are present. LLM-QA-Eval is introduced as an external automated judge without any internal derivation or self-definition that reduces to the reported outcomes. All claims (e.g., narrative-aware processing outperforming uniform sampling, bottleneck in information selection) are direct empirical observations from the submissions and do not reduce to prior quantities by construction. This is self-contained reporting against external benchmarks with no load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical competition summary report. It introduces no free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.0 · 5579 in / 1105 out tokens · 34694 ms · 2026-05-09T14:26:32.932411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    Qwen3-VL technical report.arXiv, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-VL technical report.arXiv, 2025

  2. [2]

    Qwen2.5-VL technical report.arXiv, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-VL technical report.arXiv, 2025

  3. [3]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, 2025

    Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, 2025

  4. [4]

    Long story short: Story-level video understanding from 20K short films.arXiv, 2025

    Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, and Ivan Laptev. Long story short: Story-level video understanding from 20K short films.arXiv, 2025

  5. [5]

    GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv, 2025

    GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv, 2025

  6. [6]

    Kwai Keye-VL 1.5 technical report.arXiv, 2025

    Kwai Keye Team. Kwai Keye-VL 1.5 technical report.arXiv, 2025

  7. [7]

    Introducing ChatGPT, 2022

    OpenAI. Introducing ChatGPT, 2022

  8. [8]

    Introducing GPT-4.1 in the API, 2025

    OpenAI. Introducing GPT-4.1 in the API, 2025

  9. [9]

    Introducing GPT-5, 2025

    OpenAI. Introducing GPT-5, 2025

  10. [10]

    BLEU: a method for automatic evaluation of machine translation.ACL, 2002

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation.ACL, 2002

  11. [11]

    Qwen2.5 technical report.arXiv, 2025

    Qwen Team. Qwen2.5 technical report.arXiv, 2025

  12. [12]

    Qwen3 technical report.arXiv, 2025

    Qwen Team. Qwen3 technical report.arXiv, 2025

  13. [13]

    Robust speech recognition via large-scale weak supervision.PMLR, 2023

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision.PMLR, 2023

  14. [14]

    Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and high- performance models for multilingual ASR and AST.arXiv, 2025

    Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nikolay Kar- pov, Jagadeesh Balam, and Boris Ginsburg. Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and high- performance models for multilingual ASR and AST.arXiv, 2025

  15. [15]

    Lawrence Zitnick, and Devi Parikh

    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation.CVPR, 2015

  16. [16]

    InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv, 2025

    Weiyun Wang, Zhangwei Gao, Lixin Gu, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv, 2025. 6