LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Bei Chen; Dongxu Li; Haoning Wu; Junnan Li

REVIEW 1 major objections 2 minor 41 cited by

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

LongVideoBench tests long-context video understanding with referring reasoning on videos up to an hour long.

2026-05-18 00:25 UTC pith:NP77ZS2J

load-bearing objection LongVideoBench adds a practical new benchmark for hour-scale video QA with a referring reasoning task, though the claim tying gains strictly to frame capacity rests on cross-model comparisons without tight controls. the 1 major comments →

arxiv 2407.15754 v1 pith:NP77ZS2J submitted 2024-07-22 cs.CV cs.CLcs.LG

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Haoning Wu , Dongxu Li , Bei Chen , Junnan Li This is my paper

classification cs.CV cs.CLcs.LG

keywords long video understandingmultimodal benchmarkreferring reasoninglarge multimodal modelsvideo question answeringlong-context evaluationinterleaved video language

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates LongVideoBench to fill the gap in public tests for large multimodal models handling long and rich inputs. It collects thousands of web videos with subtitles and designs questions that require retrieving and reasoning over specific details from referred parts of the video. Human annotators produce over six thousand multiple-choice items across seventeen categories. Tests show that even leading proprietary models like GPT-4o and Gemini-1.5-Pro fall short, open-source models lag further behind, and scores rise only for systems that can take in more frames.

Core claim

LongVideoBench supplies 3,763 videos and 6,678 questions that frame the core problem as accurate retrieval and reasoning over detailed multimodal information from long interleaved inputs, using a referring-reasoning task in which each question points to a referred context that the model must then analyze.

What carries the argument

Referring reasoning, the task in which a question contains a referring query that points to related video contexts called the referred context, forcing the model to locate and reason over the relevant details from that context.

Load-bearing premise

The human-annotated questions and video selection process accurately capture long-term multimodal understanding without significant curation biases or gaps in coverage of real-world scenarios.

What would settle it

A model that processes only a small number of frames yet matches or exceeds the accuracy of models that ingest many more frames on the full set of 6,678 questions would undermine the reported link between frame capacity and benchmark performance.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Proprietary models such as GPT-4o, Gemini-1.5-Pro and GPT-4-Turbo still encounter substantial difficulties on hour-long video inputs.
Open-source models display an even wider performance gap than their proprietary counterparts.
Benchmark scores rise measurably only when models gain the ability to process additional frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could use the benchmark to measure progress toward systems that retain fine detail across extended video sequences without proportional increases in compute.
The design may encourage new architectures that better fuse subtitle text with visual content over long time spans.
Similar referring-reasoning formats could be adapted to test long-context understanding in other modalities such as audio or document streams.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

LongVideoBench adds a practical new benchmark for hour-scale video QA with a referring reasoning task, though the claim tying gains strictly to frame capacity rests on cross-model comparisons without tight controls.

read the letter

The main takeaway is that this paper builds LongVideoBench to evaluate long-context multimodal understanding on videos up to an hour, using web-sourced clips with subtitles and a referring reasoning task that requires models to link a query to specific video contexts before reasoning over details. They end up with 3,763 videos and 6,678 human-annotated multiple-choice questions spread across 17 categories. That scale and task framing go beyond most existing shorter video benchmarks and give a clearer picture of where current models fall short on extended inputs. Top proprietary systems like GPT-4o and Gemini-1.5-Pro already struggle, and open-source ones lag more, which lines up with the need for better long-context handling. The results also flag that performance tracks with the ability to process more frames, which is a useful directional signal for people scaling these models. The construction itself looks careful: external videos, no self-referential loops, and a focus on retrieval plus reasoning rather than just surface matching. That part holds up as a genuine addition. The softer spot is the interpretation around frame capacity. The comparisons run across models that differ in size, pretraining, and tuning at the same time, so isolating frame count as the decisive factor would need a within-model ablation that holds architecture fixed. Without that, the “only when” phrasing is not fully secured by the evidence shown. Annotation details like agreement rates and validation steps are also light in the abstract, though the full paper may fill them in. This is aimed at researchers building or testing long-context LMMs and video understanding systems. Anyone running evaluations on extended multimodal inputs will find the category split and the performance gaps directly usable for measuring progress. It has enough substance and addresses a real gap, so it deserves a serious referee even if some result wording needs tightening in revision. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces LongVideoBench, a QA benchmark for long-context interleaved video-language understanding consisting of 3,763 web videos (up to 1 hour) with subtitles across diverse themes and 6,678 human-annotated multiple-choice questions in 17 categories. It defines a 'referring reasoning' task in which each question contains a referring query to a referred video context, requiring models to retrieve and reason over detailed multimodal information from long inputs. Evaluations on proprietary LMMs (GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo) and open-source models show substantial challenges and performance gaps, with results indicating that gains occur only for models able to process more frames.

Significance. If the annotations prove reliable and the task genuinely isolates long-context multimodal reasoning, LongVideoBench would be a valuable addition to the field as one of the largest public benchmarks targeting hour-scale video-language inputs. The human-annotated scale, thematic diversity, and explicit focus on retrieval-plus-reasoning over referred contexts provide a concrete testbed for future long-context LMMs. The reported performance ceilings on current frontier models already supply useful empirical signals.

major comments (1)

[Abstract and Results] Abstract and Results section: the statement that 'model performance on the benchmark improves only when they are capable of processing more frames' rests on cross-model comparisons. These models differ simultaneously in scale, pre-training corpus, instruction tuning, and long-context adaptation; no within-model ablation that holds architecture and training fixed while varying only frame count or context length is described. The causal 'only when' phrasing therefore lacks direct support and risks confounding.

minor comments (2)

[Benchmark Construction] Benchmark construction section: inter-annotator agreement statistics, question validation procedures, and explicit exclusion criteria for the 6,678 questions are not reported in detail. Adding these would strengthen the claim that the questions comprehensively require long-term multimodal understanding.
[Task Definition] The paper positions 'referring reasoning' as a novel formulation, yet the distinction from prior referring-expression or long-video QA tasks could be made more explicit to clarify its incremental contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive comment on the interpretation of our results. We address the major comment below and will make the corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: the statement that 'model performance on the benchmark improves only when they are capable of processing more frames' rests on cross-model comparisons. These models differ simultaneously in scale, pre-training corpus, instruction tuning, and long-context adaptation; no within-model ablation that holds architecture and training fixed while varying only frame count or context length is described. The causal 'only when' phrasing therefore lacks direct support and risks confounding.

Authors: We agree with the referee that the original phrasing in the abstract and results section implies a stronger causal relationship than is warranted by the cross-model comparisons presented. Our evaluations show that models with longer effective context windows (such as Gemini-1.5-Pro) achieve higher accuracy, while others plateau, but we acknowledge that these models also differ in scale, training data, and other factors. We will revise the abstract to replace the causal 'improves only when' with a more precise observational statement, e.g., 'we observe that performance on LongVideoBench is higher for models capable of processing more frames.' In the results section we will add explicit discussion of the limitations of cross-model analysis and note that within-model ablations varying only frame count or context length are left for future work. These changes will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is externally constructed and evaluated

full rationale

The paper constructs LongVideoBench from web-collected videos and human-annotated questions in a referring-reasoning task. All reported results are empirical evaluations of independent external models (GPT-4o, Gemini-1.5-Pro, open-source LMMs) on this fixed benchmark. No equations, fitted parameters, self-citations, or derivations are present that reduce any claim to the paper's own inputs by construction. The statement that performance improves only with greater frame capacity is an observational finding from cross-model comparisons, not a self-referential prediction or renamed input.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

This is an empirical benchmark paper with no mathematical derivations, so it introduces no free parameters, standard axioms, or invented physical entities. The primary addition is the curated dataset and task definition.

invented entities (1)

referring reasoning task no independent evidence
purpose: To require models to retrieve and reason over specific multimodal details referenced in the question from long video inputs
This is a formulated evaluation task introduced for the benchmark.

pith-pipeline@v0.9.0 · 5815 in / 1167 out tokens · 35647 ms · 2026-05-18T00:25:26.172972+00:00 · methodology

0 comments

read the original abstract

Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. To achieve this, we interpret the primary challenge as to accurately retrieve and reason over detailed multimodal information from long inputs. As such, we formulate a novel video question-answering task termed referring reasoning. Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context. The model is then required to reason over relevant video details from the referred context. Following the paradigm of referring reasoning, we curate 6,678 human-annotated multiple-choice questions in 17 fine-grained categories, establishing one of the most comprehensive benchmarks for long-form video understanding. Evaluations suggest that the LongVideoBench presents significant challenges even for the most advanced proprietary models (e.g. GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), while their open-source counterparts show an even larger performance gap. In addition, our results indicate that model performance on the benchmark improves only when they are capable of processing more frames, positioning LongVideoBench as a valuable benchmark for evaluating future-generation long-context LMMs.

discussion (0)

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HumanMoveVQA: Can Video MLLMs reason about human movement in videos?
cs.CV 2026-06 unverdicted novelty 7.0

HumanMoveVQA is a new benchmark that generates 10K+ QA pairs from 3D-lifted video tracks to evaluate video MLLMs on global human trajectory and orientation reasoning.
An Efficient Streaming Video Understanding Framework with Agentic Control
cs.CV 2026-05 unverdicted novelty 7.0

R3-Streaming uses cascaded control with age-aware memory forgetting and TB-GRPO reinforcement learning to reach SOTA scores of 57.92 on OVO-Bench and 76.36 on StreamingBench with 95-96% fewer visual tokens.
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
cs.PF 2026-04 unverdicted novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
LVBench: An Extreme Long Video Understanding Benchmark
cs.CV 2024-06 accept novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
MedStreamBench: A Time-Aware Benchmark for Streaming and Proactive Medical Video Understanding
cs.CV 2026-07 unverdicted novelty 6.0

MedStreamBench integrates 22 medical datasets into 5,419 QA instances across retrospective, present, future, and proactive temporal settings to evaluate streaming and proactive medical video understanding.
From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA
cs.CV 2026-06 unverdicted novelty 6.0

Audit of four VideoQA benchmarks reveals text-only shortcuts in VLMs; new diagnostics Blind Gap, Visual Gain, and Shortcut Score quantify and filter visual dependence.
HumanMoveVQA: Can Video MLLMs reason about human movement in videos?
cs.CV 2026-06 unverdicted novelty 6.0

HumanMoveVQA is a benchmark using 3D-lifted video tracks to evaluate video MLLMs on seven categories of global human motion reasoning, showing gaps in proprietary models but gains from fine-tuning.
HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning
cs.CV 2026-06 unverdicted novelty 6.0

HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.
Harnessing Streaming Video in the Wild
cs.CV 2026-06 unverdicted novelty 6.0

Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.
MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention
cs.CV 2026-06 unverdicted novelty 6.0

MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
PEEK: Picking Essential frames via Efficient Knowledge distillation
cs.CV 2026-05 unverdicted novelty 6.0

PEEK distills caption-conditioned frame relevance into a lightweight visual model, outperforming adaptive baselines on ActivityNet Captions and MSR-VTT especially at 1-2 frame budgets while adding only 5.2% overhead.
TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation
cs.CL 2026-05 unverdicted novelty 6.0

TeachObs is a new human-validated benchmark dataset and evaluation protocol for multimodal AI on classroom teaching observation, showing no model dominates across tracks and that models over-rate procedurally clear lessons.
An Efficient Streaming Video Understanding Framework with Agentic Control
cs.CV 2026-05 unverdicted novelty 6.0

R3-Streaming uses cascaded control, age-aware memory forgetting, and TB-GRPO reinforcement learning to reach SOTA scores on streaming video benchmarks while cutting visual token usage by 95-96%.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
QoS-QoE Translation with Large Language Model
cs.MM 2026-04 unverdicted novelty 6.0

A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
cs.CV 2026-03 unverdicted novelty 6.0

ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
cs.CV 2025-12 unverdicted novelty 6.0

OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
cs.CV 2025-05 unverdicted novelty 6.0

LiveVLM introduces VSB and PaR to compress and retrieve KV cache in streaming video LLMs, enabling LLaVA-OneVision to reach SOTA accuracy among training-free query-agnostic and training-based online models.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
cs.CV 2025-01 unverdicted novelty 6.0

MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
cs.CV 2024-12 unverdicted novelty 6.0

VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
cs.CV 2024-08 unverdicted novelty 6.0

LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
EFlow: Learning Evidence Flow for Long-Video Reasoning with Adaptive Reflection
cs.CV 2026-07 unverdicted novelty 5.0

EFlow separates temporal grounding from logical reasoning via two CoT stages and adds confidence-aware reflection, trained via SFT and RL on custom trajectory data, yielding gains on five video benchmarks.
TuringViT: Making SOTA Vision Transformers Accessible to All
cs.CV 2026-06 unverdicted novelty 5.0

TuringViT uses Turing Linear Attention, VISTA-Curation, and dynamic-resolution pretraining to outperform open ViT baselines with 10% data while improving VLM performance and high-resolution latency.
TuringViT: Making SOTA Vision Transformers Accessible to All
cs.CV 2026-06 unverdicted novelty 5.0

TuringViT claims a new ViT design with linear attention and curated data that matches SOTA performance using 10% of typical pretraining data while supporting dynamic resolutions and improving VLM integration.
VisionPulse: Dynamic Visual Sparsity for Efficient Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

VisionPulse is a step-wise visual token pruning method for LMMs that retains 5% of tokens per step, shortens reasoning traces by 11.2%, and maintains accuracy.
EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs
cs.CV 2026-05 unverdicted novelty 5.0

EgoCoT-Bench provides 3,172 verifiable QA pairs across perception, anticipation, and reasoning tasks on egocentric videos, revealing that many MLLMs give answer-correct but evidence-inconsistent explanations.
TTF: Temporal Token Fusion for Efficient Video-Language Model
cs.CV 2026-05 unverdicted novelty 5.0

TTF fuses temporally redundant visual tokens via local similarity search in a plug-and-play way, cutting ~67% tokens on Qwen3-VL-8B while retaining 99.5% accuracy with minimal overhead.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Qwen2.5-VL Technical Report
cs.CV 2025-02 unverdicted novelty 5.0

Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level lo...
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
cs.CV 2025-01 unverdicted novelty 5.0

InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding be...
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
cs.CV 2024-08 unverdicted novelty 5.0

mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
cs.CV 2026-06 unverdicted novelty 4.0

InternVideo3 introduces Multimodal Contextual Reasoning and M^2LA attention to enable closed-loop evidence accumulation in long-video understanding and agentic tool use, reporting strong benchmark results.
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 37 Pith papers · 1 internal anchor

[1]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

URL https://huggingface.co/blog/idefics. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Sec. C (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Sec. D (d) Have you read the ethics review guidelines and ensur...

work page
[3]

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

work page
[4]

for benchmarks)

If you ran experiments (e.g. for benchmarks)... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] All code, data and instructions can be assessed at https://longvideobench.github.io. (b) Did you specify all the training details (e.g., data split...

work page
[5]

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] All assets can be assessed at https://longvide...

work page
[6]

type": "image_url

If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [Yes] The instructions are included separately in Sec. E.1. (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applica...

work page 2024
[7]

Find an action or event

work page
[8]

Pause, describe/outline the scene information as the question stem

work page
[9]

Use this action or event as the answer

work page
[10]

S CENE -REFERRED OBJECT (S2O)

You may refer to these examples: • What is the boy in the video doing at Danube Square? • What happens after all the ingredients are placed in the pot? • When the video transitions to the office, what are the employees doing? • What are the characters in the video doing in the café? (non-knowledge videos) • What was George Washington doing under the apple...

work page
[12]

Describe/outline the scene information as the question stem

work page
[13]

Use the appearing people/objects and the absent ones as correct and incorrect answers respectively

work page
[14]

S CENE -REFERRED OBJECT ATTRIBUTE (S2A)

You may refer to these examples: • What objects appeared in Laura’s bedroom in the video? (Lifestyle) • When all the ingredients are chopped and placed together, which ingredient did not appear? (Cooking) • Which communication method was not mentioned in the fourth section? (Physics) • Which character did not appear at the duel in the movie? • Does the me...

work page
[15]

Find a scene, observe the people or objects in this scene

work page
[16]

Describe/outline the scene information and determine an object as the question stem

work page
[17]

Use existing and non-existing attributes of the object as correct and incorrect answers respectively, such as material, color, shape, transparency, surface characteristics, structural features

work page
[18]

EVENT -REFERRED OBJECT (E2O)

You may refer to these examples: • What clothes is Laura wearing in the bedroom with an air conditioner, a bed, and a clothes rack? • What color is used to represent the feed forward layer in the Transformer network in Figure 4? • Is the person in red clothing wearing glasses in the square with a fountain during the day? • What color horse did Napoleon ri...

work page
[19]

Find an action or event. 18

work page
[20]

Identify the participating people or objects

work page
[21]

Describe this action/event as the question stem

work page
[22]

The options should also be as detailed as possible

Based on the subtitles at the time of the action/event or other background information, detail the participating people/objects as the answer. The options should also be as detailed as possible

work page
[23]

OBJECT -REFERRED EVENT (O2E)

You may refer to these examples: • Who participated in and won the duel in the movie? • Which character finished knitting the sweater? • What object exploded in the chemistry experiment in the video? • What is the expression of the input variable passed into the Transformer in the video? V . OBJECT -REFERRED EVENT (O2E)

work page
[24]

Find a person or object

work page
[25]

Identify the actions/events that happens at their appearance

work page
[26]

Describe the person/object as the question stem

work page
[27]

Based on a scene where this person/object appears (e.g., first appearance), ask what event happened or what action they took at that time

work page
[28]

T EXT-REFERRED EVENT (T2E)

You may refer to these examples: • What did the girl in red do the first time she appeared? • What happened the first time a volcano appeared in the video? VI. T EXT-REFERRED EVENT (T2E)

work page
[30]

Identify the action in the current frame of the video

work page
[31]

Think of a few actions that did not appear in the video but are easily confused

work page
[32]

Use the action from step 2 as the correct answer, and the actions from step 3 as other options

work page
[33]

bidirectional encoder

You may refer to these examples: • What was the protagonist doing when mentioning the Renaissance? • What event happened when “bidirectional encoder” first appeared in the subtitles? VII. T EXT-REFERRED OBJECT (T2O)

work page
[35]

Identify a certain object in the frame; for example, a black water bottle

work page
[36]

Think of a few objects that did not appear in the video but are easily confused, such as a red water bottle, a black hat, a water dispenser, a transparent water cup

work page
[37]

Use the object from step 2 as the correct answer, and the objects from step 3 as other options

work page
[38]

revolutionary changes

You may refer to these examples: • What object was present when the lecturer mentioned “revolutionary changes”? • Which object did not appear when talking about Jack and Rose having a heart-to-heart conversation? VIII. T EXT-REFERRED OBJECT ATTRIBUTE (S2A)

work page
[39]

Find a segment of subtitles, pause the video

work page
[40]

Identify a certain object in the frame

work page
[41]

Identify an attribute of the object, such as material, color, shape, transparency, surface characteristics, structural features

work page
[42]

Use the object from step 2 as the correct answer, and the attributes from step 3 as other options. 19

work page
[43]

The specific instructions for each category of (L2) questions are as follows

You may refer to these examples: • What was Tesla’s hairstyle like when he was mentioned to have invented alternating current? • What color hat was the female protagonist wearing when talking about taking a break? Instructions for (L2) Relation questions. The specific instructions for each category of (L2) questions are as follows. These questions require...

work page
[44]

Find two or more adjacent actions or events

work page
[45]

Describe one of the actions/events as the question stem, and the other as the correct answer

work page
[46]

O BJECT BEFORE /AFTER OBJECT (O3O)

You may refer to these examples: • What did Clara do before taking a photo? (applicable to movie or lifestyle videos) • What needs to be done after installing the screws? (applicable to guide videos) • Which of the following historical/geographical events was mentioned first? (applicable to history/geography videos) • What did the protagonist do before pl...

work page
[47]

Find two or more people/objects/concepts that appear in the video

work page
[48]

Describe one of the objects as the question stem, and the other as the correct answer

work page
[49]

S EQUENCE OF SCENES (SSS)

You may refer to these examples: • After Jack appears, which character appears first in this movie? • Which concept is introduced first in the video after entropy is introduced? XI. S EQUENCE OF SCENES (SSS)

work page
[50]

Find multiple scenes (at least three) in the video

work page
[51]

Ask questions about the order of these scenes

work page
[52]

Answer with the correct sequence and use a few scrambled sequences as distractors

work page
[53]

First, a segment of the experiment video is played, then slides with text are shown, and finally XXXX

You may refer to this example: • Which of the following scene sequences is correct? • A. First, a segment of the experiment video is played, then slides with text are shown, and finally XXXX. • B. First, slides with text are shown, ... XII. S CENE -REFERRED OBJECT TRACKING (SOS)

work page
[56]

Then ask in which other scenes did they appear

work page
[57]

Distractors are scenes where this object did not appear

work page
[58]

Square on a sunny day, – B

You may refer to these examples: • In which of the following places did the boy who was running at the beginning of the video appear? – A. Square on a sunny day, – B. On a boat at sea, – C. In a bar on a rainy day, ... • In which other scenes did the protagonist’s lightsaber, used in the opening fight, appear? XIII. S CENE -REFERRED OBJECT ATTRIBUTE CHANG...

work page
[59]

Find a specific person/object/concept that appears in multiple scenes

work page
[61]

Then describe another scene and ask what attribute of this person/object/concept has changed at that time

work page
[62]

Changed from a white T-shirt to a black vest – B

You may refer to these examples: • What did the boy running at the beginning of the video change into when climbing the mountain at the end? – A. Changed from a white T-shirt to a black vest – B. Changed from red shoes to white shoes – C. ... • What changed in the color of the onions initially poured into the pot? • What new part did the sapling planted i...

work page
[63]

Find a segment of subtitles, and an action/event in the video that happens before/after it

work page
[64]

Rephrase/outline the subtitle as the given information and design the question stem, with the action/event as the correct answer

work page
[65]

Distractors are other actions/events in the video that do not meet the sequence relationship in the question stem

work page
[66]

I eat an apple every day

You may refer to these examples: • What did Clara do after she said, “I eat an apple every day”? • What happened before the narrator mentioned the experiment starting? • What action was performed after the chef said, “Now wait until the steak surface turns golden”? XV . OBJECT BEFORE /AFTER TEXT (T3O)

work page
[67]

Find the scene where a specific person/object first appears

work page
[68]

Then find subtitles before or after this timeframe, rephrase/outline the subtitle as the given information and design the question stem, with the object/person as the correct answer

work page
[69]

Distractors are other people/objects in the video that do not meet the sequence relationship in the question stem

work page
[70]

100 years later

You may refer to these examples: • Which characters appeared after the commentary mentioned “100 years later”? • Which animal appeared on screen before mentioning “dietary habits of North American squirrels”? XVI. T EXT-REFERRED OBJECT TRACKING (TOS)

work page
[71]

Find a specific person/object/concept that appeared at least once along with subtitles

work page
[73]

Ask on a subtitle at the object’s appearance

work page
[74]

Distractors are subtitles where this object did not appear at the corresponding moment

work page
[75]

T EXT-REFERRED OBJECT ATTRIBUTE CHANGE (TAA)

You may refer to these examples: • With which subtitles did the boy running at the beginning of the video appear? • During which of the following dialogues did the protagonist’s lightsaber, used in the opening fight, appear on screen? XVII. T EXT-REFERRED OBJECT ATTRIBUTE CHANGE (TAA)

work page
[76]

21 Figure 6: The annotation interface for L ONG VIDEO BENCH

Find a specific person/object/concept that appeared at least once along with subtitles. 21 Figure 6: The annotation interface for L ONG VIDEO BENCH

work page
[77]

Define this person/object/concept by their action/attribute in one of the scenes

work page
[78]

Ask what attribute has changed when XX text is mentioned

work page
[79]

I am going to sleep

You may refer to these examples: • What change occurred to the girl in the blue jacket and black hood in the middle of the video when mentioning “I am going to sleep”? – A. She changed the color of her hood – B. She changed into a black jacket – C. She took off her hood – D. She took off her jacket Annotation Interface. The annotation interface of LONG VI...

work page 2024
[80]

Participate in our mandatory training to understand the guidelines of annotation

work page
[81]

Each annotation includes the following terms: (a) A question; (b) One or more timestamp(s) on the question; (c) Four to five options; (d) A checkbox to pick the correct option

Watch videos, and provide annotations on these videos. Each annotation includes the following terms: (a) A question; (b) One or more timestamp(s) on the question; (c) Four to five options; (d) A checkbox to pick the correct option

work page
[82]

Check the correctness of annotations from other annotators

work page
[83]

raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw

Report videos that are not appropriate during the process. Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the indiv...

work page

[1] [1]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

URL https://huggingface.co/blog/idefics. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Sec. C (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Sec. D (d) Have you read the ethics review guidelines and ensur...

work page

[3] [3]

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

work page

[4] [4]

for benchmarks)

If you ran experiments (e.g. for benchmarks)... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] All code, data and instructions can be assessed at https://longvideobench.github.io. (b) Did you specify all the training details (e.g., data split...

work page

[5] [5]

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] All assets can be assessed at https://longvide...

work page

[6] [6]

type": "image_url

If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [Yes] The instructions are included separately in Sec. E.1. (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applica...

work page 2024

[7] [7]

Find an action or event

work page

[8] [8]

Pause, describe/outline the scene information as the question stem

work page

[9] [9]

Use this action or event as the answer

work page

[10] [10]

S CENE -REFERRED OBJECT (S2O)

You may refer to these examples: • What is the boy in the video doing at Danube Square? • What happens after all the ingredients are placed in the pot? • When the video transitions to the office, what are the employees doing? • What are the characters in the video doing in the café? (non-knowledge videos) • What was George Washington doing under the apple...

work page

[11] [12]

Describe/outline the scene information as the question stem

work page

[12] [13]

Use the appearing people/objects and the absent ones as correct and incorrect answers respectively

work page

[13] [14]

S CENE -REFERRED OBJECT ATTRIBUTE (S2A)

You may refer to these examples: • What objects appeared in Laura’s bedroom in the video? (Lifestyle) • When all the ingredients are chopped and placed together, which ingredient did not appear? (Cooking) • Which communication method was not mentioned in the fourth section? (Physics) • Which character did not appear at the duel in the movie? • Does the me...

work page

[14] [15]

Find a scene, observe the people or objects in this scene

work page

[15] [16]

Describe/outline the scene information and determine an object as the question stem

work page

[16] [17]

Use existing and non-existing attributes of the object as correct and incorrect answers respectively, such as material, color, shape, transparency, surface characteristics, structural features

work page

[17] [18]

EVENT -REFERRED OBJECT (E2O)

You may refer to these examples: • What clothes is Laura wearing in the bedroom with an air conditioner, a bed, and a clothes rack? • What color is used to represent the feed forward layer in the Transformer network in Figure 4? • Is the person in red clothing wearing glasses in the square with a fountain during the day? • What color horse did Napoleon ri...

work page

[18] [19]

Find an action or event. 18

work page

[19] [20]

Identify the participating people or objects

work page

[20] [21]

Describe this action/event as the question stem

work page

[21] [22]

The options should also be as detailed as possible

Based on the subtitles at the time of the action/event or other background information, detail the participating people/objects as the answer. The options should also be as detailed as possible

work page

[22] [23]

OBJECT -REFERRED EVENT (O2E)

You may refer to these examples: • Who participated in and won the duel in the movie? • Which character finished knitting the sweater? • What object exploded in the chemistry experiment in the video? • What is the expression of the input variable passed into the Transformer in the video? V . OBJECT -REFERRED EVENT (O2E)

work page

[23] [24]

Find a person or object

work page

[24] [25]

Identify the actions/events that happens at their appearance

work page

[25] [26]

Describe the person/object as the question stem

work page

[26] [27]

Based on a scene where this person/object appears (e.g., first appearance), ask what event happened or what action they took at that time

work page

[27] [28]

T EXT-REFERRED EVENT (T2E)

You may refer to these examples: • What did the girl in red do the first time she appeared? • What happened the first time a volcano appeared in the video? VI. T EXT-REFERRED EVENT (T2E)

work page

[28] [30]

Identify the action in the current frame of the video

work page

[29] [31]

Think of a few actions that did not appear in the video but are easily confused

work page

[30] [32]

Use the action from step 2 as the correct answer, and the actions from step 3 as other options

work page

[31] [33]

bidirectional encoder

You may refer to these examples: • What was the protagonist doing when mentioning the Renaissance? • What event happened when “bidirectional encoder” first appeared in the subtitles? VII. T EXT-REFERRED OBJECT (T2O)

work page

[32] [35]

Identify a certain object in the frame; for example, a black water bottle

work page

[33] [36]

Think of a few objects that did not appear in the video but are easily confused, such as a red water bottle, a black hat, a water dispenser, a transparent water cup

work page

[34] [37]

Use the object from step 2 as the correct answer, and the objects from step 3 as other options

work page

[35] [38]

revolutionary changes

You may refer to these examples: • What object was present when the lecturer mentioned “revolutionary changes”? • Which object did not appear when talking about Jack and Rose having a heart-to-heart conversation? VIII. T EXT-REFERRED OBJECT ATTRIBUTE (S2A)

work page

[36] [39]

Find a segment of subtitles, pause the video

work page

[37] [40]

Identify a certain object in the frame

work page

[38] [41]

Identify an attribute of the object, such as material, color, shape, transparency, surface characteristics, structural features

work page

[39] [42]

Use the object from step 2 as the correct answer, and the attributes from step 3 as other options. 19

work page

[40] [43]

The specific instructions for each category of (L2) questions are as follows

You may refer to these examples: • What was Tesla’s hairstyle like when he was mentioned to have invented alternating current? • What color hat was the female protagonist wearing when talking about taking a break? Instructions for (L2) Relation questions. The specific instructions for each category of (L2) questions are as follows. These questions require...

work page

[41] [44]

Find two or more adjacent actions or events

work page

[42] [45]

Describe one of the actions/events as the question stem, and the other as the correct answer

work page

[43] [46]

O BJECT BEFORE /AFTER OBJECT (O3O)

You may refer to these examples: • What did Clara do before taking a photo? (applicable to movie or lifestyle videos) • What needs to be done after installing the screws? (applicable to guide videos) • Which of the following historical/geographical events was mentioned first? (applicable to history/geography videos) • What did the protagonist do before pl...

work page

[44] [47]

Find two or more people/objects/concepts that appear in the video

work page

[45] [48]

Describe one of the objects as the question stem, and the other as the correct answer

work page

[46] [49]

S EQUENCE OF SCENES (SSS)

You may refer to these examples: • After Jack appears, which character appears first in this movie? • Which concept is introduced first in the video after entropy is introduced? XI. S EQUENCE OF SCENES (SSS)

work page

[47] [50]

Find multiple scenes (at least three) in the video

work page

[48] [51]

Ask questions about the order of these scenes

work page

[49] [52]

Answer with the correct sequence and use a few scrambled sequences as distractors

work page

[50] [53]

First, a segment of the experiment video is played, then slides with text are shown, and finally XXXX

You may refer to this example: • Which of the following scene sequences is correct? • A. First, a segment of the experiment video is played, then slides with text are shown, and finally XXXX. • B. First, slides with text are shown, ... XII. S CENE -REFERRED OBJECT TRACKING (SOS)

work page

[51] [56]

Then ask in which other scenes did they appear

work page

[52] [57]

Distractors are scenes where this object did not appear

work page

[53] [58]

Square on a sunny day, – B

You may refer to these examples: • In which of the following places did the boy who was running at the beginning of the video appear? – A. Square on a sunny day, – B. On a boat at sea, – C. In a bar on a rainy day, ... • In which other scenes did the protagonist’s lightsaber, used in the opening fight, appear? XIII. S CENE -REFERRED OBJECT ATTRIBUTE CHANG...

work page

[54] [59]

Find a specific person/object/concept that appears in multiple scenes

work page

[55] [61]

Then describe another scene and ask what attribute of this person/object/concept has changed at that time

work page

[56] [62]

Changed from a white T-shirt to a black vest – B

You may refer to these examples: • What did the boy running at the beginning of the video change into when climbing the mountain at the end? – A. Changed from a white T-shirt to a black vest – B. Changed from red shoes to white shoes – C. ... • What changed in the color of the onions initially poured into the pot? • What new part did the sapling planted i...

work page

[57] [63]

Find a segment of subtitles, and an action/event in the video that happens before/after it

work page

[58] [64]

Rephrase/outline the subtitle as the given information and design the question stem, with the action/event as the correct answer

work page

[59] [65]

Distractors are other actions/events in the video that do not meet the sequence relationship in the question stem

work page

[60] [66]

I eat an apple every day

You may refer to these examples: • What did Clara do after she said, “I eat an apple every day”? • What happened before the narrator mentioned the experiment starting? • What action was performed after the chef said, “Now wait until the steak surface turns golden”? XV . OBJECT BEFORE /AFTER TEXT (T3O)

work page

[61] [67]

Find the scene where a specific person/object first appears

work page

[62] [68]

Then find subtitles before or after this timeframe, rephrase/outline the subtitle as the given information and design the question stem, with the object/person as the correct answer

work page

[63] [69]

Distractors are other people/objects in the video that do not meet the sequence relationship in the question stem

work page

[64] [70]

100 years later

You may refer to these examples: • Which characters appeared after the commentary mentioned “100 years later”? • Which animal appeared on screen before mentioning “dietary habits of North American squirrels”? XVI. T EXT-REFERRED OBJECT TRACKING (TOS)

work page

[65] [71]

Find a specific person/object/concept that appeared at least once along with subtitles

work page

[66] [73]

Ask on a subtitle at the object’s appearance

work page

[67] [74]

Distractors are subtitles where this object did not appear at the corresponding moment

work page

[68] [75]

T EXT-REFERRED OBJECT ATTRIBUTE CHANGE (TAA)

You may refer to these examples: • With which subtitles did the boy running at the beginning of the video appear? • During which of the following dialogues did the protagonist’s lightsaber, used in the opening fight, appear on screen? XVII. T EXT-REFERRED OBJECT ATTRIBUTE CHANGE (TAA)

work page

[69] [76]

21 Figure 6: The annotation interface for L ONG VIDEO BENCH

Find a specific person/object/concept that appeared at least once along with subtitles. 21 Figure 6: The annotation interface for L ONG VIDEO BENCH

work page

[70] [77]

Define this person/object/concept by their action/attribute in one of the scenes

work page

[71] [78]

Ask what attribute has changed when XX text is mentioned

work page

[72] [79]

I am going to sleep

You may refer to these examples: • What change occurred to the girl in the blue jacket and black hood in the middle of the video when mentioning “I am going to sleep”? – A. She changed the color of her hood – B. She changed into a black jacket – C. She took off her hood – D. She took off her jacket Annotation Interface. The annotation interface of LONG VI...

work page 2024

[73] [80]

Participate in our mandatory training to understand the guidelines of annotation

work page

[74] [81]

Each annotation includes the following terms: (a) A question; (b) One or more timestamp(s) on the question; (c) Four to five options; (d) A checkbox to pick the correct option

Watch videos, and provide annotations on these videos. Each annotation includes the following terms: (a) A question; (b) One or more timestamp(s) on the question; (c) Four to five options; (d) A checkbox to pick the correct option

work page

[75] [82]

Check the correctness of annotations from other annotators

work page

[76] [83]

raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw

Report videos that are not appropriate during the process. Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the indiv...

work page