LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Pith reviewed 2026-05-18 00:25 UTC · model grok-4.3
The pith
LongVideoBench tests long-context video understanding with referring reasoning on videos up to an hour long.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LongVideoBench supplies 3,763 videos and 6,678 questions that frame the core problem as accurate retrieval and reasoning over detailed multimodal information from long interleaved inputs, using a referring-reasoning task in which each question points to a referred context that the model must then analyze.
What carries the argument
Referring reasoning, the task in which a question contains a referring query that points to related video contexts called the referred context, forcing the model to locate and reason over the relevant details from that context.
If this is right
- Proprietary models such as GPT-4o, Gemini-1.5-Pro and GPT-4-Turbo still encounter substantial difficulties on hour-long video inputs.
- Open-source models display an even wider performance gap than their proprietary counterparts.
- Benchmark scores rise measurably only when models gain the ability to process additional frames.
Where Pith is reading between the lines
- Developers could use the benchmark to measure progress toward systems that retain fine detail across extended video sequences without proportional increases in compute.
- The design may encourage new architectures that better fuse subtitle text with visual content over long time spans.
- Similar referring-reasoning formats could be adapted to test long-context understanding in other modalities such as audio or document streams.
Load-bearing premise
The human-annotated questions and video selection process accurately capture long-term multimodal understanding without significant curation biases or gaps in coverage of real-world scenarios.
What would settle it
A model that processes only a small number of frames yet matches or exceeds the accuracy of models that ingest many more frames on the full set of 6,678 questions would undermine the reported link between frame capacity and benchmark performance.
read the original abstract
Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. To achieve this, we interpret the primary challenge as to accurately retrieve and reason over detailed multimodal information from long inputs. As such, we formulate a novel video question-answering task termed referring reasoning. Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context. The model is then required to reason over relevant video details from the referred context. Following the paradigm of referring reasoning, we curate 6,678 human-annotated multiple-choice questions in 17 fine-grained categories, establishing one of the most comprehensive benchmarks for long-form video understanding. Evaluations suggest that the LongVideoBench presents significant challenges even for the most advanced proprietary models (e.g. GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), while their open-source counterparts show an even larger performance gap. In addition, our results indicate that model performance on the benchmark improves only when they are capable of processing more frames, positioning LongVideoBench as a valuable benchmark for evaluating future-generation long-context LMMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LongVideoBench, a QA benchmark for long-context interleaved video-language understanding consisting of 3,763 web videos (up to 1 hour) with subtitles across diverse themes and 6,678 human-annotated multiple-choice questions in 17 categories. It defines a 'referring reasoning' task in which each question contains a referring query to a referred video context, requiring models to retrieve and reason over detailed multimodal information from long inputs. Evaluations on proprietary LMMs (GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo) and open-source models show substantial challenges and performance gaps, with results indicating that gains occur only for models able to process more frames.
Significance. If the annotations prove reliable and the task genuinely isolates long-context multimodal reasoning, LongVideoBench would be a valuable addition to the field as one of the largest public benchmarks targeting hour-scale video-language inputs. The human-annotated scale, thematic diversity, and explicit focus on retrieval-plus-reasoning over referred contexts provide a concrete testbed for future long-context LMMs. The reported performance ceilings on current frontier models already supply useful empirical signals.
major comments (1)
- [Abstract and Results] Abstract and Results section: the statement that 'model performance on the benchmark improves only when they are capable of processing more frames' rests on cross-model comparisons. These models differ simultaneously in scale, pre-training corpus, instruction tuning, and long-context adaptation; no within-model ablation that holds architecture and training fixed while varying only frame count or context length is described. The causal 'only when' phrasing therefore lacks direct support and risks confounding.
minor comments (2)
- [Benchmark Construction] Benchmark construction section: inter-annotator agreement statistics, question validation procedures, and explicit exclusion criteria for the 6,678 questions are not reported in detail. Adding these would strengthen the claim that the questions comprehensively require long-term multimodal understanding.
- [Task Definition] The paper positions 'referring reasoning' as a novel formulation, yet the distinction from prior referring-expression or long-video QA tasks could be made more explicit to clarify its incremental contribution.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comment on the interpretation of our results. We address the major comment below and will make the corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results section: the statement that 'model performance on the benchmark improves only when they are capable of processing more frames' rests on cross-model comparisons. These models differ simultaneously in scale, pre-training corpus, instruction tuning, and long-context adaptation; no within-model ablation that holds architecture and training fixed while varying only frame count or context length is described. The causal 'only when' phrasing therefore lacks direct support and risks confounding.
Authors: We agree with the referee that the original phrasing in the abstract and results section implies a stronger causal relationship than is warranted by the cross-model comparisons presented. Our evaluations show that models with longer effective context windows (such as Gemini-1.5-Pro) achieve higher accuracy, while others plateau, but we acknowledge that these models also differ in scale, training data, and other factors. We will revise the abstract to replace the causal 'improves only when' with a more precise observational statement, e.g., 'we observe that performance on LongVideoBench is higher for models capable of processing more frames.' In the results section we will add explicit discussion of the limitations of cross-model analysis and note that within-model ablations varying only frame count or context length are left for future work. These changes will be incorporated in the revised manuscript. revision: yes
Circularity Check
No circularity: benchmark is externally constructed and evaluated
full rationale
The paper constructs LongVideoBench from web-collected videos and human-annotated questions in a referring-reasoning task. All reported results are empirical evaluations of independent external models (GPT-4o, Gemini-1.5-Pro, open-source LMMs) on this fixed benchmark. No equations, fitted parameters, self-citations, or derivations are present that reduce any claim to the paper's own inputs by construction. The statement that performance improves only with greater frame capacity is an observational finding from cross-model comparisons, not a self-referential prediction or renamed input.
Axiom & Free-Parameter Ledger
invented entities (1)
-
referring reasoning task
no independent evidence
Forward citations
Cited by 18 Pith papers
-
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
-
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
-
QoS-QoE Translation with Large Language Model
A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.
-
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
-
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
-
TTF: Temporal Token Fusion for Efficient Video-Language Model
TTF fuses temporally redundant visual tokens via local similarity search in a plug-and-play way, cutting ~67% tokens on Qwen3-VL-8B while retaining 99.5% accuracy with minimal overhead.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding be...
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Reference graph
Works this paper leans on
-
[1]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
URL https://huggingface.co/blog/idefics. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Sec. C (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Sec. D (d) Have you read the ethics review guidelines and ensur...
-
[3]
If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]
-
[4]
If you ran experiments (e.g. for benchmarks)... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] All code, data and instructions can be assessed at https://longvideobench.github.io. (b) Did you specify all the training details (e.g., data split...
-
[5]
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] All assets can be assessed at https://longvide...
-
[6]
If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [Yes] The instructions are included separately in Sec. E.1. (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applica...
work page 2024
-
[7]
Find an action or event
-
[8]
Pause, describe/outline the scene information as the question stem
-
[9]
Use this action or event as the answer
-
[10]
You may refer to these examples: • What is the boy in the video doing at Danube Square? • What happens after all the ingredients are placed in the pot? • When the video transitions to the office, what are the employees doing? • What are the characters in the video doing in the café? (non-knowledge videos) • What was George Washington doing under the apple...
-
[12]
Describe/outline the scene information as the question stem
-
[13]
Use the appearing people/objects and the absent ones as correct and incorrect answers respectively
-
[14]
S CENE -REFERRED OBJECT ATTRIBUTE (S2A)
You may refer to these examples: • What objects appeared in Laura’s bedroom in the video? (Lifestyle) • When all the ingredients are chopped and placed together, which ingredient did not appear? (Cooking) • Which communication method was not mentioned in the fourth section? (Physics) • Which character did not appear at the duel in the movie? • Does the me...
-
[15]
Find a scene, observe the people or objects in this scene
-
[16]
Describe/outline the scene information and determine an object as the question stem
-
[17]
Use existing and non-existing attributes of the object as correct and incorrect answers respectively, such as material, color, shape, transparency, surface characteristics, structural features
-
[18]
You may refer to these examples: • What clothes is Laura wearing in the bedroom with an air conditioner, a bed, and a clothes rack? • What color is used to represent the feed forward layer in the Transformer network in Figure 4? • Is the person in red clothing wearing glasses in the square with a fountain during the day? • What color horse did Napoleon ri...
-
[19]
Find an action or event. 18
-
[20]
Identify the participating people or objects
-
[21]
Describe this action/event as the question stem
-
[22]
The options should also be as detailed as possible
Based on the subtitles at the time of the action/event or other background information, detail the participating people/objects as the answer. The options should also be as detailed as possible
-
[23]
You may refer to these examples: • Who participated in and won the duel in the movie? • Which character finished knitting the sweater? • What object exploded in the chemistry experiment in the video? • What is the expression of the input variable passed into the Transformer in the video? V . OBJECT -REFERRED EVENT (O2E)
-
[24]
Find a person or object
-
[25]
Identify the actions/events that happens at their appearance
-
[26]
Describe the person/object as the question stem
-
[27]
Based on a scene where this person/object appears (e.g., first appearance), ask what event happened or what action they took at that time
-
[28]
You may refer to these examples: • What did the girl in red do the first time she appeared? • What happened the first time a volcano appeared in the video? VI. T EXT-REFERRED EVENT (T2E)
-
[30]
Identify the action in the current frame of the video
-
[31]
Think of a few actions that did not appear in the video but are easily confused
-
[32]
Use the action from step 2 as the correct answer, and the actions from step 3 as other options
-
[33]
You may refer to these examples: • What was the protagonist doing when mentioning the Renaissance? • What event happened when “bidirectional encoder” first appeared in the subtitles? VII. T EXT-REFERRED OBJECT (T2O)
-
[35]
Identify a certain object in the frame; for example, a black water bottle
-
[36]
Think of a few objects that did not appear in the video but are easily confused, such as a red water bottle, a black hat, a water dispenser, a transparent water cup
-
[37]
Use the object from step 2 as the correct answer, and the objects from step 3 as other options
-
[38]
You may refer to these examples: • What object was present when the lecturer mentioned “revolutionary changes”? • Which object did not appear when talking about Jack and Rose having a heart-to-heart conversation? VIII. T EXT-REFERRED OBJECT ATTRIBUTE (S2A)
-
[39]
Find a segment of subtitles, pause the video
-
[40]
Identify a certain object in the frame
-
[41]
Identify an attribute of the object, such as material, color, shape, transparency, surface characteristics, structural features
-
[42]
Use the object from step 2 as the correct answer, and the attributes from step 3 as other options. 19
-
[43]
The specific instructions for each category of (L2) questions are as follows
You may refer to these examples: • What was Tesla’s hairstyle like when he was mentioned to have invented alternating current? • What color hat was the female protagonist wearing when talking about taking a break? Instructions for (L2) Relation questions. The specific instructions for each category of (L2) questions are as follows. These questions require...
-
[44]
Find two or more adjacent actions or events
-
[45]
Describe one of the actions/events as the question stem, and the other as the correct answer
-
[46]
O BJECT BEFORE /AFTER OBJECT (O3O)
You may refer to these examples: • What did Clara do before taking a photo? (applicable to movie or lifestyle videos) • What needs to be done after installing the screws? (applicable to guide videos) • Which of the following historical/geographical events was mentioned first? (applicable to history/geography videos) • What did the protagonist do before pl...
-
[47]
Find two or more people/objects/concepts that appear in the video
-
[48]
Describe one of the objects as the question stem, and the other as the correct answer
-
[49]
You may refer to these examples: • After Jack appears, which character appears first in this movie? • Which concept is introduced first in the video after entropy is introduced? XI. S EQUENCE OF SCENES (SSS)
-
[50]
Find multiple scenes (at least three) in the video
-
[51]
Ask questions about the order of these scenes
-
[52]
Answer with the correct sequence and use a few scrambled sequences as distractors
-
[53]
You may refer to this example: • Which of the following scene sequences is correct? • A. First, a segment of the experiment video is played, then slides with text are shown, and finally XXXX. • B. First, slides with text are shown, ... XII. S CENE -REFERRED OBJECT TRACKING (SOS)
-
[56]
Then ask in which other scenes did they appear
-
[57]
Distractors are scenes where this object did not appear
-
[58]
You may refer to these examples: • In which of the following places did the boy who was running at the beginning of the video appear? – A. Square on a sunny day, – B. On a boat at sea, – C. In a bar on a rainy day, ... • In which other scenes did the protagonist’s lightsaber, used in the opening fight, appear? XIII. S CENE -REFERRED OBJECT ATTRIBUTE CHANG...
-
[59]
Find a specific person/object/concept that appears in multiple scenes
-
[61]
Then describe another scene and ask what attribute of this person/object/concept has changed at that time
-
[62]
Changed from a white T-shirt to a black vest – B
You may refer to these examples: • What did the boy running at the beginning of the video change into when climbing the mountain at the end? – A. Changed from a white T-shirt to a black vest – B. Changed from red shoes to white shoes – C. ... • What changed in the color of the onions initially poured into the pot? • What new part did the sapling planted i...
-
[63]
Find a segment of subtitles, and an action/event in the video that happens before/after it
-
[64]
Rephrase/outline the subtitle as the given information and design the question stem, with the action/event as the correct answer
-
[65]
Distractors are other actions/events in the video that do not meet the sequence relationship in the question stem
-
[66]
You may refer to these examples: • What did Clara do after she said, “I eat an apple every day”? • What happened before the narrator mentioned the experiment starting? • What action was performed after the chef said, “Now wait until the steak surface turns golden”? XV . OBJECT BEFORE /AFTER TEXT (T3O)
-
[67]
Find the scene where a specific person/object first appears
-
[68]
Then find subtitles before or after this timeframe, rephrase/outline the subtitle as the given information and design the question stem, with the object/person as the correct answer
-
[69]
Distractors are other people/objects in the video that do not meet the sequence relationship in the question stem
-
[70]
You may refer to these examples: • Which characters appeared after the commentary mentioned “100 years later”? • Which animal appeared on screen before mentioning “dietary habits of North American squirrels”? XVI. T EXT-REFERRED OBJECT TRACKING (TOS)
-
[71]
Find a specific person/object/concept that appeared at least once along with subtitles
-
[73]
Ask on a subtitle at the object’s appearance
-
[74]
Distractors are subtitles where this object did not appear at the corresponding moment
-
[75]
T EXT-REFERRED OBJECT ATTRIBUTE CHANGE (TAA)
You may refer to these examples: • With which subtitles did the boy running at the beginning of the video appear? • During which of the following dialogues did the protagonist’s lightsaber, used in the opening fight, appear on screen? XVII. T EXT-REFERRED OBJECT ATTRIBUTE CHANGE (TAA)
-
[76]
21 Figure 6: The annotation interface for L ONG VIDEO BENCH
Find a specific person/object/concept that appeared at least once along with subtitles. 21 Figure 6: The annotation interface for L ONG VIDEO BENCH
-
[77]
Define this person/object/concept by their action/attribute in one of the scenes
-
[78]
Ask what attribute has changed when XX text is mentioned
-
[79]
You may refer to these examples: • What change occurred to the girl in the blue jacket and black hood in the middle of the video when mentioning “I am going to sleep”? – A. She changed the color of her hood – B. She changed into a black jacket – C. She took off her hood – D. She took off her jacket Annotation Interface. The annotation interface of LONG VI...
work page 2024
-
[80]
Participate in our mandatory training to understand the guidelines of annotation
-
[81]
Watch videos, and provide annotations on these videos. Each annotation includes the following terms: (a) A question; (b) One or more timestamp(s) on the question; (c) Four to five options; (d) A checkbox to pick the correct option
-
[82]
Check the correctness of annotations from other annotators
-
[83]
Report videos that are not appropriate during the process. Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the indiv...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.