JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

Jianghan Chao; Jianzhang Gao; Liyun Ru; Ruihua Song; Wenhui Tan; Yuchong Sun

arxiv: 2512.12772 · v2 · pith:GQ37DA3Bnew · submitted 2025-12-14 · 💻 cs.MM · cs.CV

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

Jianghan Chao , Jianzhang Gao , Wenhui Tan , Yuchong Sun , Ruihua Song , Liyun Ru This is my paper

Pith reviewed 2026-05-16 22:29 UTC · model grok-4.3

classification 💻 cs.MM cs.CV

keywords JointAVBenchOmni-LLMsaudio-visual reasoningjoint multi-modal evaluationvideo benchmarkcross-scene reasoningautomated dataset synthesis

0 comments

The pith

Even the best Omni-LLMs reach only 65.3 percent average accuracy on a benchmark that demands strict joint audio-visual reasoning in videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JointAVBench to test models on video questions that cannot be solved from vision or audio alone. It spans five cognitive dimensions, four audio types such as speech and music, and three scene spans from single to cross-scene. An automated pipeline builds the questions and answers using vision-LLMs, audio-LLMs, and general LLMs. When leading models are tested, the top Omni-LLM scores 65.3 percent on average, beating uni-modal baselines but showing clear shortfalls especially when reasoning must cross scenes.

Core claim

JointAVBench is a new benchmark built around strict audio-video correlation and designed to evaluate Omni-LLMs across five cognitive dimensions, four audio information types, and three scene spans. An automated synthesis pipeline produces questions and answers that require joint understanding. Evaluation results show that the strongest Omni-LLM attains an average accuracy of 65.3 percent, outperforming vision-only and audio-only baselines while leaving substantial headroom for improvement, particularly on cross-scene tasks.

What carries the argument

Automated synthesis pipeline that combines vision-LLMs, audio-LLMs, and general LLMs to generate questions and answers requiring joint audio-visual understanding.

If this is right

Omni-LLMs outperform uni-modal models on tasks needing both audio and visual input.
Accuracy drops most sharply on cross-scene reasoning items.
The benchmark covers speech, sound events, music, and vocal traits across single-, cross-, and full-scene spans.
Current models still have substantial room to improve multi-modal integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may need explicit training objectives that force cross-scene audio-visual alignment.
Automated question generation could be extended with iterative human-in-the-loop checks to strengthen reliability.
JointAVBench could become a standard yardstick for measuring progress in general video understanding systems.

Load-bearing premise

The automated pipeline produces questions and answers that truly require joint audio-visual understanding and contain no biases or answer leakage from the generation process.

What would settle it

A human review that finds many questions can be answered correctly from vision alone or audio alone would show the joint-dependency claim does not hold.

Figures

Figures reproduced from arXiv: 2512.12772 by Jianghan Chao, Jianzhang Gao, Liyun Ru, Ruihua Song, Wenhui Tan, Yuchong Sun.

**Figure 1.** Figure 1: Examples of JointAVBench. (a) asks a cross-scene plot-related question that needs the visual information in Scene 3 and the speech information in Scene 1 and Scene 23 to reason the right answer. (b) asks a single-scene emotion-related question that needs the visual information of the speaker and his vocal traits to answer. four audio signal types (vocal traits, music, speech, and sound event), and three di… view at source ↗

**Figure 2.** Figure 2: Pipeline for JointAVBench. Our construction pipeline is three-fold: (a) Omni-modal caption generation, (b) QA pair creation, and (c) Quality control. 3.2 BENCHMARK CONSTRUCTION Our dataset construction pipeline is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Statistics of JointAVBench. 3.2.3 STAGE 3: QUALITY CONTROL We implement a multi-stage quality control process to address issues identified in the collected 9,109 QA pairs, such as mismatched question-answer pairs and redundant information. This process employs a general-to-specific verification strategy, where we guide models to use a chain-of-thought approach for step-by-step data filtering. General Verif… view at source ↗

**Figure 4.** Figure 4: Results on JointAVBench across different audio types. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Results on JointAVBench across different scene types. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Results on JointAVBench across 5 cognitive dimensions. 0-20 20-40 40-60 60+ 20 25 30 35 40 45 50 55 Accuracy Qwen2.5-Omni Gemini2.5-Flash Qwen2.5-VL GPT-4o Kimi-Audio [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: More details of JointAVBench. The number 1-5 in Figure (b) are human ratings, 1 represents [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Additional cases of JointAVBench. The first and second row represents single-scene tasks, [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Prompts for generating omni-modal caption. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for audio caption refinement. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Prompts for generating QA pairs 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Prompts for general checks 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Prompts for sequence check and ambiguity check. [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Prompts for audio check. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Prompts for interval check and distractor generation. [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

read the original abstract

Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 65.3\%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JointAVBench adds a useful benchmark for strict joint audio-visual reasoning but its automated pipeline lacks the checks needed to confirm questions truly require both modalities.

read the letter

The core contribution is JointAVBench, a dataset built to test Omni-LLMs on questions that need both audio and video. It covers four audio types (speech, sound events, music, vocal traits) and three scene spans (single, cross, full), with an automated synthesis pipeline using vision-LLMs, audio-LLMs, and general LLMs to generate the items. They report that the best Omni model reaches only 65.3% average accuracy, beating uni-modal baselines but showing clear limits, especially on cross-scene cases. This design directly targets gaps in prior benchmarks around multi-modal dependency and scene variety, and the evaluation setup is straightforward to follow. The automated pipeline is a practical choice given labeling costs. The main weakness is missing validation: no uni-modal accuracy numbers on the generated questions during curation, no human agreement rates, and no error analysis on whether the pipeline introduced leakage or shortcuts. Without those, the 65.3% figure is harder to read as a clean joint-reasoning ceiling. The paper is aimed at researchers working on multi-modal video models and benchmark design. Anyone testing Omni-LLMs on AV tasks would find the coverage and numbers worth looking at. It deserves peer review because the benchmark idea is concrete and addresses a real evaluation gap, even though the pipeline details will need tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces JointAVBench, a benchmark for evaluating Omni-LLMs on joint audio-visual reasoning in videos. It targets three gaps in prior datasets: strict multi-modal dependency (questions unsolvable from vision or audio alone), coverage of four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, full-scene). An automated pipeline using vision-LLMs, audio-LLMs, and general LLMs synthesizes questions and answers; leading models are then evaluated, with the best Omni-LLM reaching 65.3% average accuracy while outperforming uni-modal baselines and showing particular weakness on cross-scene items.

Significance. If the generated questions are verifiably dependent on joint input, the benchmark would be a useful addition for measuring integrated audio-visual reasoning. The reported 65.3% ceiling on the strongest Omni-LLM, together with the explicit cross-scene breakdown, supplies a concrete, falsifiable signal of current limitations and could usefully direct future model work. The automated construction approach itself is scalable and addresses the high cost of manual annotation.

major comments (2)

[§3.2] §3.2 (Automated Pipeline): The description of successive prompting with vision-LLM + audio-LLM + general LLM does not include any uni-modal accuracy measurements on the generated questions during curation. Without these numbers, it is impossible to confirm that the items cannot be solved from vision or audio alone, undermining the central claim of strict multi-modal dependency.
[§4] §4 (Evaluation): No human verification statistics are reported (e.g., percentage of questions independently confirmed by humans to require joint AV input, or inter-annotator agreement). Given the fully automated generation process, such rates are necessary to bound the risk of answer leakage or modality-specific shortcuts, especially for cross-scene items whose temporal alignment is LLM-mediated.

minor comments (2)

The abstract states that the benchmark spans 'five cognitive dimensions' but does not enumerate them; the main text should list them explicitly with one-sentence definitions to aid reproducibility.
[Table 2] Table 2 (or equivalent results table): report the exact number of questions per audio type and per scene span so readers can assess balance and statistical reliability of the per-category accuracies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will revise the paper accordingly to strengthen the validation of the benchmark.

read point-by-point responses

Referee: [§3.2] §3.2 (Automated Pipeline): The description of successive prompting with vision-LLM + audio-LLM + general LLM does not include any uni-modal accuracy measurements on the generated questions during curation. Without these numbers, it is impossible to confirm that the items cannot be solved from vision or audio alone, undermining the central claim of strict multi-modal dependency.

Authors: We agree that explicit uni-modal accuracy measurements on the final questions are needed to rigorously support the strict multi-modal dependency claim. Although the pipeline design uses successive prompting to enforce integration across modalities, we did not report these numbers in the original submission. In the revised manuscript, we will add a dedicated analysis evaluating vision-only and audio-only models on the generated questions, quantifying the accuracy drop to confirm that items cannot be solved from a single modality. revision: yes
Referee: [§4] §4 (Evaluation): No human verification statistics are reported (e.g., percentage of questions independently confirmed by humans to require joint AV input, or inter-annotator agreement). Given the fully automated generation process, such rates are necessary to bound the risk of answer leakage or modality-specific shortcuts, especially for cross-scene items whose temporal alignment is LLM-mediated.

Authors: We recognize that human verification statistics would provide additional assurance against leakage or shortcuts, especially for cross-scene items. The automated pipeline was introduced to address the prohibitive cost of full manual annotation, but we agree this leaves a validation gap. In the revision, we will conduct a human study on a sampled subset of questions (stratified by scene span), reporting the percentage independently confirmed to require joint AV input along with inter-annotator agreement, with focused analysis on cross-scene cases. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is empirically constructed and evaluated via direct measurements

full rationale

The paper introduces JointAVBench through an automated pipeline using external vision-LLMs, audio-LLMs, and general LLMs, then reports direct accuracy measurements (e.g., best Omni-LLM at 65.3%) on held-out questions against uni-modal baselines. No derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. The central claims reduce to empirical observations rather than any self-referential construction or renaming of inputs. Potential concerns about question leakage or modality shortcuts are validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the automated synthesis pipeline produces valid, non-leaking joint-reasoning questions. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption State-of-the-art vision-LLMs, audio-LLMs, and general LLMs can be prompted to generate questions and answers that strictly require joint audio-visual understanding.
Invoked in the description of the automated pipeline; no independent verification details provided in abstract.

pith-pipeline@v0.9.0 · 5562 in / 1186 out tokens · 23158 ms · 2026-05-16T22:29:35.985190+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
cs.MM 2026-05 unverdicted novelty 7.0

FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
cs.CV 2026-04 unverdicted novelty 7.0

MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks
cs.MM 2026-05 unverdicted novelty 6.0

MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
cs.CL 2026-04 unverdicted novelty 6.0

MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 5 Pith papers

[1]

Excuse me

achieves optimal performance in identifying potential hallucinations. During the general check, we utilize only the QA pair and its explanation to filter out unqualified QA pairs. This stage includes four checks: modality, format, content, and speculation checks. The details of each check are as follows: (i) modality check assesses whether the modality cl...

work page 2024
[2]

Only include details that are clearly visible in the video

Scene Setting: Briefly describe the environment, location, time of day, lighting, and any notable objects or elements in the background. Only include details that are clearly visible in the video

work page
[3]

Focus on their most significant movements, gestures, and interactions

Characters and Actions: Highlight the appearance, clothing, and key actions of any characters present. Focus on their most significant movements, gestures, and interactions. Do not infer emotions, intentions, or backstory unless explicitly shown through visual cues

work page
[4]

Include the sequence of events and the pacing of the scene to convey how it unfolds over time

Scene Dynamics: Describe any important changes in the scene. Include the sequence of events and the pacing of the scene to convey how it unfolds over time. Only describe what is visually evident

work page
[5]

Only describe emotions that are clearly expressed through visible actions or expressions

Emotional Tone: Convey the mood or atmosphere through the most impactful visual cues, such as facial expressions, body language, or environmental details. Only describe emotions that are clearly expressed through visible actions or expressions

work page
[6]

Only include events that are explicitly shown in the video

Key Events: Highlight any significant events or actions that occur within the scene, focusing on their narrative importance or impact on the characters. Only include events that are explicitly shown in the video. Important Notes: - Strictly factual: Ensure the description is entirely based on the visual content of the video. Do not include any information...

work page
[7]

- Non-verbal vocalizations: (e.g., laughter, sighing, coughing, crying, humming, screaming)

Sound Events (if clearly present): - Non-speech sounds: (e.g., crumpling, footsteps, door closing, glass breaking). - Non-verbal vocalizations: (e.g., laughter, sighing, coughing, crying, humming, screaming). - Characteristics (only if unambiguous): - Pitch (high/low), timbre (bright/muffled), rhythm (steady/erratic), volume (loud/soft). - Rules: - Do NOT...

work page
[8]

- Mood(e.g., cheerful, tense, melancholic)

Background Music (if clearly present): - Instruments (e.g., piano, strings, electric guitar). - Mood(e.g., cheerful, tense, melancholic). - Avoid technical terms (BPM, key, scales). - Rules: - Do NOT describe lyrics or vocal melodies. - If the music is ambiguous (e.g., genre unclear), only state observable features. Critical Constraints: - Do NOT describe...

work page
[9]

happy, sad, angry, fearful, surprised, disgusted, excited) Speaker traits: [Directly discernible characteristics like age/gender if evident from voice]

Output format for each utterance: Speech Content: [Exact dialogue content] Emotion: [Observed emotional tone] (eg. happy, sad, angry, fearful, surprised, disgusted, excited) Speaker traits: [Directly discernible characteristics like age/gender if evident from voice]

work page
[10]

Rules: (1) Only describe emotions clearly conveyed through vocal tone (2) Note speaker characteristics ONLY when immediately apparent from voice (e.g. ”child-like voice”) (3) Never add interpretations beyond what the audio contains (4) Process each utterance separately (5) Non-speech audio (music or sound only): output ”[Non-speech audio: skip analysis]” ...

work page
[11]

- Discard any speech content marked with neutral emotion (e.g., ”neutral tone”, ”neutral mood”) immediately, regardless of subtitle alignment

Comparison with Subtitle: - Directly compare the Speech Content text with the provided subtitle. - Discard any speech content marked with neutral emotion (e.g., ”neutral tone”, ”neutral mood”) immediately, regardless of subtitle alignment. - For non-neutral speech content: - If the speech content contains phrases that overlap or align with the subtitle (e...

work page
[12]

- Preserve all other emotional/tonal descriptors (e.g., ”sad mood,” ”English accent”)

Output of Emotional Information: For each utterance: - Only if the vocal traits aligns with the subtitle and is non-neutral: - Replace any speech content in the vocal traits (i.e., quoted or referenced dialogue) with the exact matching phrase from the subtitle. - Preserve all other emotional/tonal descriptors (e.g., ”sad mood,” ”English accent”). - Format...

work page
[13]

Question: A question related to the video clip

work page
[15]

Your task is to evaluate whether the question can be answered using only one modality (either video or audio) or if it requires both modalities

Explanation: An explanation supporting the answer. Your task is to evaluate whether the question can be answered using only one modality (either video or audio) or if it requires both modalities. Please strictly base your judgment on the information explicitly required to answer the question, as well as the content of the provided answer and explanation. ...

work page
[16]

Information Analysis: Analyze the question to identify the specific visual and auditory details required to answer it. Does the question require visual details (e.g., objects, actions, or settings) or auditory details (e.g., speech, sound effects, or music)? Extract the visual and the auditory information from the explanation to determine which modalities...

work page
[17]

- Determine if the question can be answered using only the extracted video text

Modality Assessment: Based on the analysis of the question and explanation, determine if the required information can be obtained entirely from one modality (either video or audio) or if both audio and visual modalities are necessary. - Determine if the question can be answered using only the extracted video text. - Determine if the question can be answer...

work page
[18]

Otherwise, output [YES]

Conclusion: Provide your final determination: Output [YES] if the question explicitly requires information from both video and audio modalities to be answered correctly, or if the answer and explanation rely on information from both modalities. Otherwise, output [YES]. Here is the question:{question} Here is the answer:{answer} Here is the explanation:{ex...

work page
[19]

Question: A question related to a given context

work page
[20]

Answer: An answer provided to the question

work page
[21]

Your task is to evaluate the quality of the question-answer pair by performing two checks: format check and content check

Explanation: An explanation supporting the answer. Your task is to evaluate the quality of the question-answer pair by performing two checks: format check and content check. Please strictly base your judgment on the information explicitly provided in the question, answer, and explanation. Avoid making assumptions beyond what is stated. Please follow these...

work page
[22]

- Check if the answer addresses all the pieces of information requested in the question

Format Check: - Analyze the question to determine how many distinct pieces of information it is asking for. - Check if the answer addresses all the pieces of information requested in the question. - If the question asks for only one piece of information and the answer fully addresses it, proceed to the content check. - If the question asks for multiple pi...

work page
[23]

- Check if the answer can be derived from the explanation and if the answer is correct based on the context of the question

Content Check: - Analyze the explanation to determine if it is reasonable and logically sound. - Check if the answer can be derived from the explanation and if the answer is correct based on the context of the question. - If the explanation is reasonable and the answer is correct and supported by the explanation, output ‘[YES]‘. - If the explanation is un...

work page
[24]

- If the explanation provides clear, evidence-based reasoning or logical steps to derive the answer, proceed to the final output

Speculation Check: - Analyze the explanation to determine if the answer relies too heavily on speculation rather than concrete evidence or logical reasoning. - If the explanation provides clear, evidence-based reasoning or logical steps to derive the answer, proceed to the final output. - If the explanation relies on assumptions, guesses, or unsupported c...

work page
[25]

Search through all video segments to find its first occurrence

work page
[26]

Record for each element: - Segment ID of first appearance - Modality type (video caption/subtitle/speech emotion)

work page
[27]

If any element cannot be found→Output ”[NO] (unverifiable element: [element name])” Stage 3: Element Validation Verify the located elements meet these criteria:

work page
[28]

Unique Segment Check: - All elements must appear in different segments - If any segment ID is shared→Output ”[NO] (co-occurring elements: [element1] & [element2] in segment X)”

work page
[29]

Modality Diversity Check: - Elements must come from≥2 different modalities - If all same modality→Output ”[NO] (single modality: [modality type])” Stage 4: Order Verification

work page
[30]

Sort elements by their first appearance segment ID (ascending)

work page
[31]

Prompt for ambiguity check You are a QA evaluation assistant tasked with filtering incorrect or low-quality question-answer pairs based on video and audio context

Compare against provided answer: - If orders match→Output ”[YES]” - If orders differ→Output ”[Corrected]” with proper order and explanation Output Format: [Validating]¡4-stage analysis¿ [Output]: [YES/NO/Corrected] [Corrected: (a) (b) (c)](if applicable) [Explanation](if Corrected): - (a) [element1]: first appears in segment [X] ([modality]) - (b) [elemen...

work page
[32]

- Check if this music information appears in the provided ‘Music Content‘

Phase 1: Music Information Validation - Extract the music-related information used in the ‘Answer‘ and ‘Explanation‘ of the QA pair. - Check if this music information appears in the provided ‘Music Content‘. - If the music information is not found in the ‘Music Content‘, output ‘[NO]‘ (invalid QA pair)

work page
[33]

Phase 2: Visual Information Cross-Check - If the music information is valid (from Phase 1), analyze whether the emotion/atmosphere described in the music can also be inferred from the ‘Video Caption‘ (e.g., character expressions, scene mood, or events). - Example: If the music mentions a ”sad atmosphere” and the video shows ”a character crying,” the music...

work page
[34]

- Otherwise, output ‘[NO]‘

Phase 3: Final Judgment - If the QA pair passes both Phase 1 and Phase 2 (i.e., music info is valid and cannot be inferred visually), output ‘[YES]‘. - Otherwise, output ‘[NO]‘. Here is the provided information:{segments info} - Question:{question} - Answer:{answer} - Explanation:{explanation} Perform the three-phase analysis and output either ’[YES]’ or ...

work page
[35]

A question-answer pair

work page
[36]

An explanation of how the answer is derived

work page
[37]

Complete information about all movie segments (timestamps, video descriptions, audio descriptions, and subtitles) Instructions:

work page
[38]

Carefully analyze the question and answer explanation to understand what information is required

work page
[39]

Examine all movie segments sequentially to locate where the relevant information begins

work page
[40]

Determine where the last necessary piece of information appears

work page
[41]

Select the earliest segment where required information starts (start segment) and the latest segment where required information ends (end segment) Ensure:

work page
[42]

The selected segments form a continuous sequence

work page
[43]

The sequence is not from the very first to the very last segment

work page
[44]

All information needed to answer the question is contained within this sequence

work page
[45]

Make sure that the generated output follow output format

The sequence is as compact as possible Output Format: Provide your response in this exact format: [Start]: ¡segment number¿[End]: ¡segment number¿[Rationale]: ¡brief explanation of why these segments were chosen¿ Important Notes: - If the answer requires information that only appears in disjoint segments, select the smallest continuous sequence that conta...

work page
[46]

**Selective Modification**: Alter specific elements such as character actions, dialogue, objects, or settings to create plausible yet incorrect options

work page
[47]

**Maintain Plausibility**: Ensure each distractor could feasibly occur within the context of the video, making them appear credible based on the visual and audio cues

work page
[48]

- **Dialogue Adjustments**: Propose believable alterations to dialogue or audio cues that didn’t actually occur

**Incorporate Diverse Misdirections**: - **Action Confusion**: Modify or swap character actions or events in ways that fit the context but are incorrect. - **Dialogue Adjustments**: Propose believable alterations to dialogue or audio cues that didn’t actually occur. - **Object or Setting Misdirection**: Suggest plausible but incorrect details about object...

work page
[49]

**Incorporate Partial Truths**: Use true audio-visual details or partial truths within the distractors to add complexity, ensuring these elements do not directly answer the question but make the distractors more compelling

work page
[50]

**Avoid Obvious Falsities**: Shift the context or details significantly without creating options that are blatantly wrong or unrelated to the video

work page
[51]

Requirements for Distractors:

**Ensure Distinct Incorrectness**: Craft distractors that will be clearly identifiable as incorrect by someone who has closely watched and listened to the video, challenging their attention to detail. Requirements for Distractors:

work page
[52]

Plausibility: Each distractor should seem correct at first glance, matching the tone and structure of the correct answer

work page
[53]

Variety: Errors should vary (e.g., minor inaccuracies, flipped terms, oversimplifications, or common misconceptions)

work page
[54]

Consistency: Maintain the same verb tense, technicality, and formatting as the correct answer. Format the output as follows: [Distractor 1]¡Incorrect but plausible option¿ [Distractor 2]¡Incorrect but plausible option¿ [Distractor 3]¡Incorrect but plausible option¿ Provided Information: Background infromation:{segments info} Question:{question} Correct An...

work page

[1] [1]

Excuse me

achieves optimal performance in identifying potential hallucinations. During the general check, we utilize only the QA pair and its explanation to filter out unqualified QA pairs. This stage includes four checks: modality, format, content, and speculation checks. The details of each check are as follows: (i) modality check assesses whether the modality cl...

work page 2024

[2] [2]

Only include details that are clearly visible in the video

Scene Setting: Briefly describe the environment, location, time of day, lighting, and any notable objects or elements in the background. Only include details that are clearly visible in the video

work page

[3] [3]

Focus on their most significant movements, gestures, and interactions

Characters and Actions: Highlight the appearance, clothing, and key actions of any characters present. Focus on their most significant movements, gestures, and interactions. Do not infer emotions, intentions, or backstory unless explicitly shown through visual cues

work page

[4] [4]

Include the sequence of events and the pacing of the scene to convey how it unfolds over time

Scene Dynamics: Describe any important changes in the scene. Include the sequence of events and the pacing of the scene to convey how it unfolds over time. Only describe what is visually evident

work page

[5] [5]

Only describe emotions that are clearly expressed through visible actions or expressions

Emotional Tone: Convey the mood or atmosphere through the most impactful visual cues, such as facial expressions, body language, or environmental details. Only describe emotions that are clearly expressed through visible actions or expressions

work page

[6] [6]

Only include events that are explicitly shown in the video

Key Events: Highlight any significant events or actions that occur within the scene, focusing on their narrative importance or impact on the characters. Only include events that are explicitly shown in the video. Important Notes: - Strictly factual: Ensure the description is entirely based on the visual content of the video. Do not include any information...

work page

[7] [7]

- Non-verbal vocalizations: (e.g., laughter, sighing, coughing, crying, humming, screaming)

Sound Events (if clearly present): - Non-speech sounds: (e.g., crumpling, footsteps, door closing, glass breaking). - Non-verbal vocalizations: (e.g., laughter, sighing, coughing, crying, humming, screaming). - Characteristics (only if unambiguous): - Pitch (high/low), timbre (bright/muffled), rhythm (steady/erratic), volume (loud/soft). - Rules: - Do NOT...

work page

[8] [8]

- Mood(e.g., cheerful, tense, melancholic)

Background Music (if clearly present): - Instruments (e.g., piano, strings, electric guitar). - Mood(e.g., cheerful, tense, melancholic). - Avoid technical terms (BPM, key, scales). - Rules: - Do NOT describe lyrics or vocal melodies. - If the music is ambiguous (e.g., genre unclear), only state observable features. Critical Constraints: - Do NOT describe...

work page

[9] [9]

happy, sad, angry, fearful, surprised, disgusted, excited) Speaker traits: [Directly discernible characteristics like age/gender if evident from voice]

Output format for each utterance: Speech Content: [Exact dialogue content] Emotion: [Observed emotional tone] (eg. happy, sad, angry, fearful, surprised, disgusted, excited) Speaker traits: [Directly discernible characteristics like age/gender if evident from voice]

work page

[10] [10]

Rules: (1) Only describe emotions clearly conveyed through vocal tone (2) Note speaker characteristics ONLY when immediately apparent from voice (e.g. ”child-like voice”) (3) Never add interpretations beyond what the audio contains (4) Process each utterance separately (5) Non-speech audio (music or sound only): output ”[Non-speech audio: skip analysis]” ...

work page

[11] [11]

- Discard any speech content marked with neutral emotion (e.g., ”neutral tone”, ”neutral mood”) immediately, regardless of subtitle alignment

Comparison with Subtitle: - Directly compare the Speech Content text with the provided subtitle. - Discard any speech content marked with neutral emotion (e.g., ”neutral tone”, ”neutral mood”) immediately, regardless of subtitle alignment. - For non-neutral speech content: - If the speech content contains phrases that overlap or align with the subtitle (e...

work page

[12] [12]

- Preserve all other emotional/tonal descriptors (e.g., ”sad mood,” ”English accent”)

Output of Emotional Information: For each utterance: - Only if the vocal traits aligns with the subtitle and is non-neutral: - Replace any speech content in the vocal traits (i.e., quoted or referenced dialogue) with the exact matching phrase from the subtitle. - Preserve all other emotional/tonal descriptors (e.g., ”sad mood,” ”English accent”). - Format...

work page

[13] [13]

Question: A question related to the video clip

work page

[14] [15]

Your task is to evaluate whether the question can be answered using only one modality (either video or audio) or if it requires both modalities

Explanation: An explanation supporting the answer. Your task is to evaluate whether the question can be answered using only one modality (either video or audio) or if it requires both modalities. Please strictly base your judgment on the information explicitly required to answer the question, as well as the content of the provided answer and explanation. ...

work page

[15] [16]

Information Analysis: Analyze the question to identify the specific visual and auditory details required to answer it. Does the question require visual details (e.g., objects, actions, or settings) or auditory details (e.g., speech, sound effects, or music)? Extract the visual and the auditory information from the explanation to determine which modalities...

work page

[16] [17]

- Determine if the question can be answered using only the extracted video text

Modality Assessment: Based on the analysis of the question and explanation, determine if the required information can be obtained entirely from one modality (either video or audio) or if both audio and visual modalities are necessary. - Determine if the question can be answered using only the extracted video text. - Determine if the question can be answer...

work page

[17] [18]

Otherwise, output [YES]

Conclusion: Provide your final determination: Output [YES] if the question explicitly requires information from both video and audio modalities to be answered correctly, or if the answer and explanation rely on information from both modalities. Otherwise, output [YES]. Here is the question:{question} Here is the answer:{answer} Here is the explanation:{ex...

work page

[18] [19]

Question: A question related to a given context

work page

[19] [20]

Answer: An answer provided to the question

work page

[20] [21]

Your task is to evaluate the quality of the question-answer pair by performing two checks: format check and content check

Explanation: An explanation supporting the answer. Your task is to evaluate the quality of the question-answer pair by performing two checks: format check and content check. Please strictly base your judgment on the information explicitly provided in the question, answer, and explanation. Avoid making assumptions beyond what is stated. Please follow these...

work page

[21] [22]

- Check if the answer addresses all the pieces of information requested in the question

Format Check: - Analyze the question to determine how many distinct pieces of information it is asking for. - Check if the answer addresses all the pieces of information requested in the question. - If the question asks for only one piece of information and the answer fully addresses it, proceed to the content check. - If the question asks for multiple pi...

work page

[22] [23]

- Check if the answer can be derived from the explanation and if the answer is correct based on the context of the question

Content Check: - Analyze the explanation to determine if it is reasonable and logically sound. - Check if the answer can be derived from the explanation and if the answer is correct based on the context of the question. - If the explanation is reasonable and the answer is correct and supported by the explanation, output ‘[YES]‘. - If the explanation is un...

work page

[23] [24]

- If the explanation provides clear, evidence-based reasoning or logical steps to derive the answer, proceed to the final output

Speculation Check: - Analyze the explanation to determine if the answer relies too heavily on speculation rather than concrete evidence or logical reasoning. - If the explanation provides clear, evidence-based reasoning or logical steps to derive the answer, proceed to the final output. - If the explanation relies on assumptions, guesses, or unsupported c...

work page

[24] [25]

Search through all video segments to find its first occurrence

work page

[25] [26]

Record for each element: - Segment ID of first appearance - Modality type (video caption/subtitle/speech emotion)

work page

[26] [27]

If any element cannot be found→Output ”[NO] (unverifiable element: [element name])” Stage 3: Element Validation Verify the located elements meet these criteria:

work page

[27] [28]

Unique Segment Check: - All elements must appear in different segments - If any segment ID is shared→Output ”[NO] (co-occurring elements: [element1] & [element2] in segment X)”

work page

[28] [29]

Modality Diversity Check: - Elements must come from≥2 different modalities - If all same modality→Output ”[NO] (single modality: [modality type])” Stage 4: Order Verification

work page

[29] [30]

Sort elements by their first appearance segment ID (ascending)

work page

[30] [31]

Prompt for ambiguity check You are a QA evaluation assistant tasked with filtering incorrect or low-quality question-answer pairs based on video and audio context

Compare against provided answer: - If orders match→Output ”[YES]” - If orders differ→Output ”[Corrected]” with proper order and explanation Output Format: [Validating]¡4-stage analysis¿ [Output]: [YES/NO/Corrected] [Corrected: (a) (b) (c)](if applicable) [Explanation](if Corrected): - (a) [element1]: first appears in segment [X] ([modality]) - (b) [elemen...

work page

[31] [32]

- Check if this music information appears in the provided ‘Music Content‘

Phase 1: Music Information Validation - Extract the music-related information used in the ‘Answer‘ and ‘Explanation‘ of the QA pair. - Check if this music information appears in the provided ‘Music Content‘. - If the music information is not found in the ‘Music Content‘, output ‘[NO]‘ (invalid QA pair)

work page

[32] [33]

Phase 2: Visual Information Cross-Check - If the music information is valid (from Phase 1), analyze whether the emotion/atmosphere described in the music can also be inferred from the ‘Video Caption‘ (e.g., character expressions, scene mood, or events). - Example: If the music mentions a ”sad atmosphere” and the video shows ”a character crying,” the music...

work page

[33] [34]

- Otherwise, output ‘[NO]‘

Phase 3: Final Judgment - If the QA pair passes both Phase 1 and Phase 2 (i.e., music info is valid and cannot be inferred visually), output ‘[YES]‘. - Otherwise, output ‘[NO]‘. Here is the provided information:{segments info} - Question:{question} - Answer:{answer} - Explanation:{explanation} Perform the three-phase analysis and output either ’[YES]’ or ...

work page

[34] [35]

A question-answer pair

work page

[35] [36]

An explanation of how the answer is derived

work page

[36] [37]

Complete information about all movie segments (timestamps, video descriptions, audio descriptions, and subtitles) Instructions:

work page

[37] [38]

Carefully analyze the question and answer explanation to understand what information is required

work page

[38] [39]

Examine all movie segments sequentially to locate where the relevant information begins

work page

[39] [40]

Determine where the last necessary piece of information appears

work page

[40] [41]

Select the earliest segment where required information starts (start segment) and the latest segment where required information ends (end segment) Ensure:

work page

[41] [42]

The selected segments form a continuous sequence

work page

[42] [43]

The sequence is not from the very first to the very last segment

work page

[43] [44]

All information needed to answer the question is contained within this sequence

work page

[44] [45]

Make sure that the generated output follow output format

The sequence is as compact as possible Output Format: Provide your response in this exact format: [Start]: ¡segment number¿[End]: ¡segment number¿[Rationale]: ¡brief explanation of why these segments were chosen¿ Important Notes: - If the answer requires information that only appears in disjoint segments, select the smallest continuous sequence that conta...

work page

[45] [46]

**Selective Modification**: Alter specific elements such as character actions, dialogue, objects, or settings to create plausible yet incorrect options

work page

[46] [47]

**Maintain Plausibility**: Ensure each distractor could feasibly occur within the context of the video, making them appear credible based on the visual and audio cues

work page

[47] [48]

- **Dialogue Adjustments**: Propose believable alterations to dialogue or audio cues that didn’t actually occur

**Incorporate Diverse Misdirections**: - **Action Confusion**: Modify or swap character actions or events in ways that fit the context but are incorrect. - **Dialogue Adjustments**: Propose believable alterations to dialogue or audio cues that didn’t actually occur. - **Object or Setting Misdirection**: Suggest plausible but incorrect details about object...

work page

[48] [49]

**Incorporate Partial Truths**: Use true audio-visual details or partial truths within the distractors to add complexity, ensuring these elements do not directly answer the question but make the distractors more compelling

work page

[49] [50]

**Avoid Obvious Falsities**: Shift the context or details significantly without creating options that are blatantly wrong or unrelated to the video

work page

[50] [51]

Requirements for Distractors:

**Ensure Distinct Incorrectness**: Craft distractors that will be clearly identifiable as incorrect by someone who has closely watched and listened to the video, challenging their attention to detail. Requirements for Distractors:

work page

[51] [52]

Plausibility: Each distractor should seem correct at first glance, matching the tone and structure of the correct answer

work page

[52] [53]

Variety: Errors should vary (e.g., minor inaccuracies, flipped terms, oversimplifications, or common misconceptions)

work page

[53] [54]

Consistency: Maintain the same verb tense, technicality, and formatting as the correct answer. Format the output as follows: [Distractor 1]¡Incorrect but plausible option¿ [Distractor 2]¡Incorrect but plausible option¿ [Distractor 3]¡Incorrect but plausible option¿ Provided Information: Background infromation:{segments info} Question:{question} Correct An...

work page