SF20K Competition 2025: Summary and findings
Pith reviewed 2026-05-09 14:26 UTC · model grok-4.3
The pith
The SF20K competition shows that information selection and reasoning structure, not raw model capacity, limit long-form video question answering, with top systems at 65.7 percent versus 91.7 percent for humans.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the SF20K-Test benchmark of amateur short films and open-ended questions, leading methods rely on narrative-aware shot-level video processing together with multi-stage pipelines; this allows models below 8 billion parameters to approach the performance of models over 30 times larger, yet all entries remain well below the human ceiling of 91.7 percent, showing that the dominant limits are in information selection and reasoning structure.
What carries the argument
The SF20K-Test benchmark of 95 short films and 979 questions, scored by the LLM-QA-Eval judge, which forces reliance on multimodal story comprehension instead of memorization of popular content.
If this is right
- Narrative-aware shot-level processing outperforms uniform frame sampling for long videos.
- Multi-stage pipelines with smaller models can equal or beat end-to-end inference on models over 30 times larger.
- Subtitle quality is a dominant factor in overall performance on story-level questions.
- The main bottleneck is information selection and reasoning structure rather than model size.
- A substantial gap persists between current methods and human-level narrative comprehension.
Where Pith is reading between the lines
- Future systems may benefit more from improved mechanisms for selecting and summarizing relevant shots than from further increases in parameter count.
- The benchmark's use of amateur films reduces the risk of models relying on pre-trained movie knowledge, making it useful for testing genuine story understanding.
- Integrating stronger subtitle processing or explicit narrative tracking modules could narrow the observed performance gap.
- The results suggest testing whether hybrid approaches that combine small selection models with larger reasoning models can close more of the gap to humans.
Load-bearing premise
The automated LLM-QA-Eval judge based on GPT-4.1-nano gives a reliable and unbiased measure of answer quality that matches human judgment on open-ended video questions.
What would settle it
A direct comparison of LLM-QA-Eval scores against human ratings on the same set of model answers from the competition submissions.
Figures
read the original abstract
This report presents the results and findings of the first edition of the Short-Films 20K (SF20K) Competition, held in conjunction with the SLoMO Workshop at ICCV 2025. The competition is designed to advance story-level video understanding beyond short-clip action recognition, introducing an open-ended video question-answering task built on a corpus of amateur short films. This setup ensures that models must rely on multimodal understanding rather than memorization of popular movies. Evaluation is conducted using the SF20K-Test benchmark (95 movies, 979 question-answer pairs) and scored via LLM-QA-Eval, an automated judge based on GPT-4.1-nano. The competition attracted 22 teams and 286 submissions across two tracks: a Main Track with unrestricted model size and a Special Track limited to models under 8 billion parameters. The winning team achieved 65.7% accuracy on the Main Track and 48.7% on the Special Track, against a human performance ceiling of 91.7%. Our analysis reveals several key findings: narrative-aware, shot-level processing consistently outperforms uniform frame sampling; well-designed multi-stage pipelines using smaller models can match or exceed end-to-end inference with models over 30x larger; and subtitle quality is a dominant factor in performance. These results highlight that the primary bottleneck in long-form video QA lies in information selection and reasoning structure rather than raw model capacity, and that a substantial gap remains between current methods and human-level narrative comprehension.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper summarizes the SF20K Competition 2025 results for open-ended long-form video QA on a corpus of 95 amateur short films (SF20K-Test: 979 QA pairs). It reports 22 teams and 286 submissions across a Main Track (unrestricted model size, winner 65.7%) and Special Track (models <8B parameters, winner 48.7%), with a human ceiling of 91.7%. Evaluation uses the LLM-QA-Eval automated judge based on GPT-4.1-nano. Key observational findings are that narrative-aware shot-level processing outperforms uniform sampling, multi-stage pipelines with smaller models can match or exceed much larger end-to-end models, subtitle quality is a dominant factor, the primary bottleneck is information selection and reasoning structure rather than raw capacity, and a substantial gap remains to human-level narrative comprehension.
Significance. If the LLM-QA-Eval judge is shown to align with human judgments, the results would be significant for video understanding research by providing empirical evidence from a controlled competition that targeted processing strategies and reasoning pipelines can be more impactful than model scale alone, while establishing a new benchmark that avoids memorization of popular media. The competition format and participation numbers also offer a useful snapshot of current method capabilities on narrative video QA.
major comments (2)
- [Abstract] Abstract: The central claims that 'the primary bottleneck in long-form video QA lies in information selection and reasoning structure rather than raw model capacity' and that narrative-aware shot-level processing is superior rest entirely on performance rankings and comparisons produced by the GPT-4.1-nano-based LLM-QA-Eval judge. No correlation study, human validation, or inter-rater agreement metrics are reported for this judge on the 979 SF20K-Test questions, so the observed patterns (including the 65.7%/48.7% scores and the gap to the 91.7% human ceiling) cannot be confidently interpreted as reflecting true capabilities rather than judge-specific biases in answer style or content.
- [Abstract] Abstract: The human performance ceiling of 91.7% is stated without any description of the evaluation protocol used for humans (e.g., whether humans answered directly and were scored by the same LLM judge, or were assessed by other humans), which is load-bearing for interpreting the claimed gap between current methods and human-level narrative comprehension.
minor comments (1)
- [Abstract] The abstract would be clearer if it briefly noted the total number of submissions per track and any controls applied to prevent data leakage or memorization beyond the use of amateur films.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and support for the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims that 'the primary bottleneck in long-form video QA lies in information selection and reasoning structure rather than raw model capacity' and that narrative-aware shot-level processing is superior rest entirely on performance rankings and comparisons produced by the GPT-4.1-nano-based LLM-QA-Eval judge. No correlation study, human validation, or inter-rater agreement metrics are reported for this judge on the 979 SF20K-Test questions, so the observed patterns (including the 65.7%/48.7% scores and the gap to the 91.7% human ceiling) cannot be confidently interpreted as reflecting true capabilities rather than judge-specific biases in answer style or content.
Authors: We agree that the lack of a dedicated correlation study between LLM-QA-Eval and human judgments on the SF20K-Test set weakens the evidential basis for interpreting the observed performance patterns and bottlenecks. The judge follows the design from our prior related work, where it demonstrated reasonable alignment, but no such analysis was performed or reported specifically for the 979 questions in this competition summary. In the revised manuscript, we will expand the methods section to include additional details on the judge's prompting strategy, any available prior validation metrics, and a discussion of potential biases. However, conducting a new full correlation or inter-rater study on the complete test set would require fresh human annotations that are not currently available. revision: partial
-
Referee: [Abstract] Abstract: The human performance ceiling of 91.7% is stated without any description of the evaluation protocol used for humans (e.g., whether humans answered directly and were scored by the same LLM judge, or were assessed by other humans), which is load-bearing for interpreting the claimed gap between current methods and human-level narrative comprehension.
Authors: We appreciate this observation and agree that the protocol must be described for proper interpretation of the human-model gap. Human performance was measured by having participants watch the full videos and provide open-ended answers, which were then scored using the identical LLM-QA-Eval judge to maintain consistency with model evaluation. We will revise the abstract to include a brief mention of this protocol and add a dedicated paragraph in the main text detailing the human evaluation setup, including instructions to annotators, number of participants, and computation of the 91.7% score. revision: yes
- A full correlation study, human validation, and inter-rater agreement metrics for the LLM-QA-Eval judge on the complete set of 979 SF20K-Test questions
Circularity Check
Purely empirical competition report with no derivations or self-referential reductions
full rationale
The paper is a summary of competition results on the SF20K-Test benchmark, reporting submission scores, track winners, and observational findings from 22 teams. No equations, parameter fits, or derivation chains are present. LLM-QA-Eval is introduced as an external automated judge without any internal derivation or self-definition that reduces to the reported outcomes. All claims (e.g., narrative-aware processing outperforming uniform sampling, bottleneck in information selection) are direct empirical observations from the submissions and do not reduce to prior quantities by construction. This is self-contained reporting against external benchmarks with no load-bearing self-citations or ansatzes.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen3-VL technical report.arXiv, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-VL technical report.arXiv, 2025
work page 2025
-
[2]
Qwen2.5-VL technical report.arXiv, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-VL technical report.arXiv, 2025
work page 2025
-
[3]
Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, 2025
work page 2025
-
[4]
Long story short: Story-level video understanding from 20K short films.arXiv, 2025
Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, and Ivan Laptev. Long story short: Story-level video understanding from 20K short films.arXiv, 2025
work page 2025
-
[5]
GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv, 2025
work page 2025
-
[6]
Kwai Keye-VL 1.5 technical report.arXiv, 2025
Kwai Keye Team. Kwai Keye-VL 1.5 technical report.arXiv, 2025
work page 2025
- [7]
- [8]
- [9]
-
[10]
BLEU: a method for automatic evaluation of machine translation.ACL, 2002
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation.ACL, 2002
work page 2002
- [11]
- [12]
-
[13]
Robust speech recognition via large-scale weak supervision.PMLR, 2023
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision.PMLR, 2023
work page 2023
-
[14]
Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nikolay Kar- pov, Jagadeesh Balam, and Boris Ginsburg. Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and high- performance models for multilingual ASR and AST.arXiv, 2025
work page 2025
-
[15]
Lawrence Zitnick, and Devi Parikh
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation.CVPR, 2015
work page 2015
-
[16]
Weiyun Wang, Zhangwei Gao, Lixin Gu, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv, 2025. 6
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.