arxiv: 2601.10649 · v2 · submitted 2026-01-15 · 💻 cs.CV

MINERVA-Cultural: A Benchmark for Cultural and Multilingual Long Video Reasoning

Darshan Singh , Arsha Nagrani , Kawshik Manikantan , Harman Singh , Dinesh Tewari , Tobias Weyand , Cordelia Schmid , Anelia Angelova

show 1 more author

Shachi Dave

This is my paper

Pith reviewed 2026-05-16 13:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords cultural video understandingmultilingual benchmarksvideo large language modelslong video reasoningcultural bias in AIhuman annotationsevidence graphs

0 comments p. Extension

The pith

A new benchmark shows state-of-the-art video language models fall far below human accuracy on multicultural and multilingual long-video reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MINERVA-Cultural, a benchmark built from human-annotated videos drawn from 18 distinct global locales, each with native-language questions, answers, and multi-step reasoning chains. Existing video-LLMs achieve substantially lower accuracy than humans on these tasks, and the dominant source of error is failure to correctly perceive culturally specific visual details. This gap arises because prior benchmarks rely on Western-centric data and automatic translations that do not capture situated cultural context. Establishing a more demanding test of this kind matters for any deployment of video models that must operate across languages and societies.

Core claim

MINERVA-Cultural shows that current Video-LLMs perform substantially below human-level accuracy on long videos that require recognition of culturally specific visual elements, with errors concentrated in visual perception rather than language or general reasoning. The benchmark supplies entirely human-generated native-language annotations and reasoning traces across 18 locales, and the authors convert those traces into evidence graphs that support an iterative procedure for isolating fine-grained reasoning mistakes.

What carries the argument

MINERVA-Cultural benchmark of human-generated multicultural video annotations paired with native-language questions and reasoning steps across 18 locales, together with the derived evidence graphs used for error localization.

If this is right

Video models must improve recognition of culturally specific visual elements to close the performance gap with humans.
Automatic translation pipelines are insufficient for creating reliable cultural-reasoning benchmarks.
Reasoning-trace graphs can be used to diagnose precise points of failure in long video chains.
Training data for video models needs explicit coverage of diverse regional cultural contexts rather than dominant-language sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark supplies a concrete target for fine-tuning or retrieval-augmented models aimed at cultural sensitivity.
Error graphs derived from native reasoning traces could be applied to other video tasks to surface hidden cultural biases.
Future extensions could test whether adding explicit cultural metadata to training videos reduces the observed perception failures.
Real-world systems that interpret video in non-English settings will likely need similar locale-specific evaluation to avoid systematic underperformance.

Load-bearing premise

The human-generated annotations across the 18 locales are free of annotator bias and require situated cultural knowledge beyond general visual or linguistic understanding.

What would settle it

A video-LLM that reaches human accuracy on the benchmark after training only on generic visual features with no additional cultural data would falsify the claim that cultural visual perception is the primary failure mode.

read the original abstract

Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce MINERVA-Cultural, a challenging benchmark for multicultural and multilingual video reasoning. MINERVA-Cultural comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, MINERVA-Cultural provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on MINERVA-Cultural requires a deeply situated understanding of visual cultural context. Furthermore, we leverage MINERVA-Cultural's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. MINERVA-Cultural will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MINERVA-Cultural, a benchmark for multicultural and multilingual long video reasoning comprising entirely human-generated annotations from region-specific videos across 18 global locales. Questions, answers, and multi-step reasoning traces are provided in native languages rather than translations. The authors evaluate state-of-the-art Video-LLMs, report substantial performance gaps relative to human accuracy, and attribute the errors primarily to failures in visual perception of cultural elements. They further propose constructing evidence-based graphs from the reasoning traces and an iterative strategy to diagnose fine-grained reasoning errors.

Significance. If validated, the benchmark would meaningfully advance the field by mitigating western-centric and English-dominant biases in video understanding evaluation. The native-language, human-crafted reasoning traces and the graph-based error analysis method represent concrete strengths that could support more interpretable model development. The work's empirical focus on situated cultural context aligns with growing interest in culturally aware multimodal systems.

major comments (3)

[Abstract] Abstract and Evaluation section: the attribution of model errors 'primarily stemming from the visual perception of cultural elements' is load-bearing for the central claim yet rests on an unverified assumption. No ablation studies, cultural-knowledge variants, or comparisons showing that questions cannot be solved via generic visual or linguistic cues are reported, rendering the error-source conclusion circular without additional validation.
[Methods] Methods/Annotation Process: data selection criteria for the videos and questions across the 18 locales, including how representativeness was ensured and potential selection biases mitigated, are not specified. This information is required to assess whether the benchmark genuinely demands situated cultural knowledge rather than universal video understanding.
[Evaluation] Evaluation section: inter-annotator agreement statistics, number of annotators per locale, and their cultural expertise qualifications are absent. Without these, the human baseline and the claim of 'deeply situated understanding' lack the necessary grounding for the reported performance gaps to be interpreted reliably.

minor comments (2)

[Abstract] The GitHub link in the abstract should be verified for completeness and accompanied by a brief description of the released assets (videos, annotations, splits) to support reproducibility.
[Proposed Method] Notation for the evidence-based graphs and iterative strategy could be introduced more formally with a small diagram or pseudocode to clarify the pipeline for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for their constructive and insightful review, which highlights both the potential significance of MINERVA-Cultural and areas where additional clarity would strengthen the manuscript. We address each major comment below and commit to revisions that improve transparency without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract and Evaluation section: the attribution of model errors 'primarily stemming from the visual perception of cultural elements' is load-bearing for the central claim yet rests on an unverified assumption. No ablation studies, cultural-knowledge variants, or comparisons showing that questions cannot be solved via generic visual or linguistic cues are reported, rendering the error-source conclusion circular without additional validation.

Authors: We thank the referee for this observation. The attribution in the current manuscript is based on our fine-grained error analysis using the evidence-based graphs derived from the human reasoning traces, which allowed us to categorize model failures into visual cultural perception errors versus other types. However, we acknowledge that this analysis would benefit from further validation. In the revision, we will expand the Evaluation section to provide more explicit details on the categorization methodology and tone down the abstract claim to reflect that it is supported by the graph-based diagnostics rather than exhaustive ablations. We will also note the absence of full ablation studies as a limitation. revision: partial
Referee: [Methods] Methods/Annotation Process: data selection criteria for the videos and questions across the 18 locales, including how representativeness was ensured and potential selection biases mitigated, are not specified. This information is required to assess whether the benchmark genuinely demands situated cultural knowledge rather than universal video understanding.

Authors: We agree that explicit documentation of the selection process is necessary to substantiate the benchmark's focus on situated cultural knowledge. The original manuscript describes the videos as region-specific and culturally diverse but does not detail the criteria. We will revise the Methods section to include the video and question selection criteria, the process for ensuring geographic and cultural representativeness across the 18 locales, and the steps taken to reduce selection bias. revision: yes
Referee: [Evaluation] Evaluation section: inter-annotator agreement statistics, number of annotators per locale, and their cultural expertise qualifications are absent. Without these, the human baseline and the claim of 'deeply situated understanding' lack the necessary grounding for the reported performance gaps to be interpreted reliably.

Authors: We appreciate this point regarding the grounding of the human baseline. While the annotations were performed by native speakers with relevant cultural expertise, the manuscript does not report the quantitative details. In the revised version, we will add inter-annotator agreement statistics, the number of annotators per locale, and a description of their cultural expertise qualifications to the Evaluation section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark creation and evaluation

full rationale

The paper introduces MINERVA-Cultural as a new human-annotated benchmark for multicultural video reasoning and reports empirical model evaluations against human baselines. No derivation chain, equations, fitted parameters, or predictions exist that reduce outputs to inputs by construction. Error analysis via reasoning traces and evidence graphs is downstream of the benchmark data rather than self-defining or self-referential. The work is self-contained against external human performance metrics with no load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that native human annotators can reliably produce culturally accurate questions and reasoning steps that isolate visual cultural perception as the key failure mode.

axioms (1)

domain assumption Human annotations from region-specific experts constitute reliable ground truth for cultural video understanding
The benchmark treats these annotations as the reference standard against which model performance is measured.

pith-pipeline@v0.9.0 · 5540 in / 1355 out tokens · 49134 ms · 2026-05-16T13:35:52.945005+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We leverage CURVE’s reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning... 75% of all failures can be attributed to cultural visual perception.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

YouTube ID:4EaRHj2Qa3g Question:How many total no. of empty raids happened in the first half from the time the player from Nilgiri Knights gets injured till the time when a team takes the first review? Answer:Five (5) 21 CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning Skills:Temporal Event Localization, Counting, Visual Cultural Unde...

work page
[2]

Hence, this just requires 2 skills while we require questions to have a minimum of 3 skills

Question:Which event occurs after the old lady serves prasad in this video? Answer:A red cloth is being placed on young girls head Why is this a bad question?This question tests just temporal ordering (which event happens after an event?) and visual cultural understanding (old lady serving prasad). Hence, this just requires 2 skills while we require quest...

work page
[3]

Why is this a bad question?Because it can be answered with general cultural knowledge of Holi WITHOUT even looking at the video

Question:In what order do the following events happen?: People throwing colored powder (gulal) and water on each other, lighting a bonfire (Holika Dahan), sharing sweets & festive treats with friends and family. Why is this a bad question?Because it can be answered with general cultural knowledge of Holi WITHOUT even looking at the video

work page
[4]

Reasoning Steps

Question:How many times does the woman in red saree say the word "namak" in the video? Why is it a bad question?Because it is too easy and can be solved by listening to the speech alone. B.1.2. Reasoning steps • A reasoning step is an action that you would take to break down the question solving process. You can think of them as the building blocks to the...

work page
[5]

The language or script of the answer should not be taken into account

You should accept answers that are accurate translations or transliterations of the reference answer. The language or script of the answer should not be taken into account. For example: a) reference answer = भेलपूरी model’s answer=Bhel Puri score=2

work page
[6]

Don’t penalize for spelling errors or minor variations

Allow for alternate names of the same cultural concept. Don’t penalize for spelling errors or minor variations. Focus solely on the cultural concept. For example: a)reference answer=The London Eye; model’s answer=Millennium Wheel score=2

work page
[7]

For example: a)reference answer=Sun Temple model’s answer=Temple score=1

Partial scoring of1 can be given in cases where model misses details or the answer is not complete. For example: a)reference answer=Sun Temple model’s answer=Temple score=1

work page
[8]

For questions with a numerical answer, the score is determined by an exact value match. Award a full score of2 if the model’s answer represents the same numerical value as the reference answer, regardless of whether it is written in digits (e.g.,10) or words (e.g.,ten). If the model’s answer represents any other number and there is no exact value match, i...

work page
[9]

Question: question Reference answer: gt Model’s answer: pred Your response: Figure 11|The prompt used for the LLM Judge (Gemini-2.5-Flash)in our main evaluation pipeline (Tab

Your answer should only contain a single integer value in[0, 1, 2]and nothing else. Question: question Reference answer: gt Model’s answer: pred Your response: Figure 11|The prompt used for the LLM Judge (Gemini-2.5-Flash)in our main evaluation pipeline (Tab. 2 of the main paper). It details the three-point (0-2) scoring criteria for assessing the semanti...

work page
[10]

atomic evidence node

Identify Atomic Nodes:Extract the causal chain of evidence,excludingall procedural text, negative findings, and dead ends. Create an “atomic evidence node” (Node1, Node2...) for each distinct step. A node must be: •Derived from a single video timestamp, OR •A single piece of external information, OR •A single inference over existing evidence. • Action:Str...

work page
[11]

Add a node ID toparent_nodes ONLYif the current node is strictly dependent on that parent node’s information

Construct Graph:Define directed edges. Add a node ID toparent_nodes ONLYif the current node is strictly dependent on that parent node’s information

work page
[12]

solution_graph

Validate Termination:Ensure the graph ends with the finalAnswer. If no existing node contains the answer, create a final node containing the answer text and link it to its immediate evidence. Output Format: ReturnONLYa single valid JSON object. { "solution_graph": { "Node1": { "evidence": "<Evidence Text>", "timestamp": "<Timestamp or N/A>", "parent_nodes...

work page
[13]

3.Evaluation Determinestatusbased on answer-affecting information only (ignoring methodology)

Non-Contradictory:The path does not contradict or compete against the human reasoning, question, or answer. 3.Evaluation Determinestatusbased on answer-affecting information only (ignoring methodology). 1.Content Check:Did the model find the correct evidence content? i.If No:Setstatustowrong. Proceed toCausal Analysis. ii.If Yes:Proceed to Timestamp Check...

work page
[14]

Knowledge Failure(Knowledge-Dependent Issue): Was the plan correct, but the model relied on an incorrect internal fact or tool result?

work page
[15]

Temporal Failure(Temporal Localization): Did the model fail to search within the correct time segments or miss a critical timestamp?

work page
[16]

Did the model miss a visual/audio object, under-count objects, or fail to find an object matching the attribute specifications?

Spatial / Detection Failure(Spatial Grounding): *(Assumption: Time is correct)*. Did the model miss a visual/audio object, under-count objects, or fail to find an object matching the attribute specifications?

work page
[17]

Did the model incorrectly identify properties (color, text, type) of the found object? 6.Hallucination(Spurious Object/Event): Did the modelover-countobjects or invent events?

Attribute Failure(Attribute Misidentification): *(Assumption: Object detection is correct)*. Did the model incorrectly identify properties (color, text, type) of the found object? 6.Hallucination(Spurious Object/Event): Did the modelover-countobjects or invent events?

work page
[18]

(Part 1) 34 CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning VIDEO QA EVALUATOR PROMPT (PART 2)

Logical Failure(Reasoning): Did the model find the correct evidence but fail to connect it to the right conclusion, or ignore provided info? Figure 13|Prompt to traverse through the graph and tag errors. (Part 1) 34 CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning VIDEO QA EVALUATOR PROMPT (PART 2)

work page
[19]

2.Otherwise: Set toN/A

If divergence is true AND evidence_retrieved is no: Explainwhyit failed using the findings from Part A and Part B. 2.Otherwise: Set toN/A. Task 2: Generate thenext_attempt_cue Create a single, cumulative block of text to guide the model’s next attempt. 1.Start:Include all text fromAdditional Cues

work page
[20]

Here is what we know so far

Summarize Success:For allright nodes leading up to the error, state the evidence, timestamps, and logic as established facts (e.g., "Here is what we know so far..."). 3.Guide the Error:For thefirstwrongordivergentnode: (a)Evidence Errors:State the correctevidenceandtimestamp. (b)Reasoning Errors:State the correctevidence,timestamp, and the correct line of...

work page
[21]

You missed

Constraints:Donotmention past failures (e.g., "You missed") and donotreveal info about undeterminednodes. Input Question (English):{question_english} Additional Cues:{additional_cues} Ground Truth Answer (English):{ground_truth_answer_english} Model Prediction (English):{model_prediction_english} Model Thoughts/Outputs:{model_output_thoughts} Evidence Gra...

work page