JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation
Pith reviewed 2026-05-16 22:29 UTC · model grok-4.3
The pith
Even the best Omni-LLMs reach only 65.3 percent average accuracy on a benchmark that demands strict joint audio-visual reasoning in videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
JointAVBench is a new benchmark built around strict audio-video correlation and designed to evaluate Omni-LLMs across five cognitive dimensions, four audio information types, and three scene spans. An automated synthesis pipeline produces questions and answers that require joint understanding. Evaluation results show that the strongest Omni-LLM attains an average accuracy of 65.3 percent, outperforming vision-only and audio-only baselines while leaving substantial headroom for improvement, particularly on cross-scene tasks.
What carries the argument
Automated synthesis pipeline that combines vision-LLMs, audio-LLMs, and general LLMs to generate questions and answers requiring joint audio-visual understanding.
If this is right
- Omni-LLMs outperform uni-modal models on tasks needing both audio and visual input.
- Accuracy drops most sharply on cross-scene reasoning items.
- The benchmark covers speech, sound events, music, and vocal traits across single-, cross-, and full-scene spans.
- Current models still have substantial room to improve multi-modal integration.
Where Pith is reading between the lines
- Models may need explicit training objectives that force cross-scene audio-visual alignment.
- Automated question generation could be extended with iterative human-in-the-loop checks to strengthen reliability.
- JointAVBench could become a standard yardstick for measuring progress in general video understanding systems.
Load-bearing premise
The automated pipeline produces questions and answers that truly require joint audio-visual understanding and contain no biases or answer leakage from the generation process.
What would settle it
A human review that finds many questions can be answered correctly from vision alone or audio alone would show the joint-dependency claim does not hold.
Figures
read the original abstract
Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 65.3\%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces JointAVBench, a benchmark for evaluating Omni-LLMs on joint audio-visual reasoning in videos. It targets three gaps in prior datasets: strict multi-modal dependency (questions unsolvable from vision or audio alone), coverage of four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, full-scene). An automated pipeline using vision-LLMs, audio-LLMs, and general LLMs synthesizes questions and answers; leading models are then evaluated, with the best Omni-LLM reaching 65.3% average accuracy while outperforming uni-modal baselines and showing particular weakness on cross-scene items.
Significance. If the generated questions are verifiably dependent on joint input, the benchmark would be a useful addition for measuring integrated audio-visual reasoning. The reported 65.3% ceiling on the strongest Omni-LLM, together with the explicit cross-scene breakdown, supplies a concrete, falsifiable signal of current limitations and could usefully direct future model work. The automated construction approach itself is scalable and addresses the high cost of manual annotation.
major comments (2)
- [§3.2] §3.2 (Automated Pipeline): The description of successive prompting with vision-LLM + audio-LLM + general LLM does not include any uni-modal accuracy measurements on the generated questions during curation. Without these numbers, it is impossible to confirm that the items cannot be solved from vision or audio alone, undermining the central claim of strict multi-modal dependency.
- [§4] §4 (Evaluation): No human verification statistics are reported (e.g., percentage of questions independently confirmed by humans to require joint AV input, or inter-annotator agreement). Given the fully automated generation process, such rates are necessary to bound the risk of answer leakage or modality-specific shortcuts, especially for cross-scene items whose temporal alignment is LLM-mediated.
minor comments (2)
- The abstract states that the benchmark spans 'five cognitive dimensions' but does not enumerate them; the main text should list them explicitly with one-sentence definitions to aid reproducibility.
- [Table 2] Table 2 (or equivalent results table): report the exact number of questions per audio type and per scene span so readers can assess balance and statistical reliability of the per-category accuracies.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will revise the paper accordingly to strengthen the validation of the benchmark.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Automated Pipeline): The description of successive prompting with vision-LLM + audio-LLM + general LLM does not include any uni-modal accuracy measurements on the generated questions during curation. Without these numbers, it is impossible to confirm that the items cannot be solved from vision or audio alone, undermining the central claim of strict multi-modal dependency.
Authors: We agree that explicit uni-modal accuracy measurements on the final questions are needed to rigorously support the strict multi-modal dependency claim. Although the pipeline design uses successive prompting to enforce integration across modalities, we did not report these numbers in the original submission. In the revised manuscript, we will add a dedicated analysis evaluating vision-only and audio-only models on the generated questions, quantifying the accuracy drop to confirm that items cannot be solved from a single modality. revision: yes
-
Referee: [§4] §4 (Evaluation): No human verification statistics are reported (e.g., percentage of questions independently confirmed by humans to require joint AV input, or inter-annotator agreement). Given the fully automated generation process, such rates are necessary to bound the risk of answer leakage or modality-specific shortcuts, especially for cross-scene items whose temporal alignment is LLM-mediated.
Authors: We recognize that human verification statistics would provide additional assurance against leakage or shortcuts, especially for cross-scene items. The automated pipeline was introduced to address the prohibitive cost of full manual annotation, but we agree this leaves a validation gap. In the revision, we will conduct a human study on a sampled subset of questions (stratified by scene span), reporting the percentage independently confirmed to require joint AV input along with inter-annotator agreement, with focused analysis on cross-scene cases. revision: yes
Circularity Check
No circularity: benchmark is empirically constructed and evaluated via direct measurements
full rationale
The paper introduces JointAVBench through an automated pipeline using external vision-LLMs, audio-LLMs, and general LLMs, then reports direct accuracy measurements (e.g., best Omni-LLM at 65.3%) on held-out questions against uni-modal baselines. No derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. The central claims reduce to empirical observations rather than any self-referential construction or renaming of inputs. Potential concerns about question leakage or modality shortcuts are validity issues, not circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption State-of-the-art vision-LLMs, audio-LLMs, and general LLMs can be prompted to generate questions and answers that strictly require joint audio-visual understanding.
Forward citations
Cited by 5 Pith papers
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
-
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
-
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks
MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
-
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
Reference graph
Works this paper leans on
-
[1]
achieves optimal performance in identifying potential hallucinations. During the general check, we utilize only the QA pair and its explanation to filter out unqualified QA pairs. This stage includes four checks: modality, format, content, and speculation checks. The details of each check are as follows: (i) modality check assesses whether the modality cl...
work page 2024
-
[2]
Only include details that are clearly visible in the video
Scene Setting: Briefly describe the environment, location, time of day, lighting, and any notable objects or elements in the background. Only include details that are clearly visible in the video
-
[3]
Focus on their most significant movements, gestures, and interactions
Characters and Actions: Highlight the appearance, clothing, and key actions of any characters present. Focus on their most significant movements, gestures, and interactions. Do not infer emotions, intentions, or backstory unless explicitly shown through visual cues
-
[4]
Include the sequence of events and the pacing of the scene to convey how it unfolds over time
Scene Dynamics: Describe any important changes in the scene. Include the sequence of events and the pacing of the scene to convey how it unfolds over time. Only describe what is visually evident
-
[5]
Only describe emotions that are clearly expressed through visible actions or expressions
Emotional Tone: Convey the mood or atmosphere through the most impactful visual cues, such as facial expressions, body language, or environmental details. Only describe emotions that are clearly expressed through visible actions or expressions
-
[6]
Only include events that are explicitly shown in the video
Key Events: Highlight any significant events or actions that occur within the scene, focusing on their narrative importance or impact on the characters. Only include events that are explicitly shown in the video. Important Notes: - Strictly factual: Ensure the description is entirely based on the visual content of the video. Do not include any information...
-
[7]
- Non-verbal vocalizations: (e.g., laughter, sighing, coughing, crying, humming, screaming)
Sound Events (if clearly present): - Non-speech sounds: (e.g., crumpling, footsteps, door closing, glass breaking). - Non-verbal vocalizations: (e.g., laughter, sighing, coughing, crying, humming, screaming). - Characteristics (only if unambiguous): - Pitch (high/low), timbre (bright/muffled), rhythm (steady/erratic), volume (loud/soft). - Rules: - Do NOT...
-
[8]
- Mood(e.g., cheerful, tense, melancholic)
Background Music (if clearly present): - Instruments (e.g., piano, strings, electric guitar). - Mood(e.g., cheerful, tense, melancholic). - Avoid technical terms (BPM, key, scales). - Rules: - Do NOT describe lyrics or vocal melodies. - If the music is ambiguous (e.g., genre unclear), only state observable features. Critical Constraints: - Do NOT describe...
-
[9]
Output format for each utterance: Speech Content: [Exact dialogue content] Emotion: [Observed emotional tone] (eg. happy, sad, angry, fearful, surprised, disgusted, excited) Speaker traits: [Directly discernible characteristics like age/gender if evident from voice]
-
[10]
Rules: (1) Only describe emotions clearly conveyed through vocal tone (2) Note speaker characteristics ONLY when immediately apparent from voice (e.g. ”child-like voice”) (3) Never add interpretations beyond what the audio contains (4) Process each utterance separately (5) Non-speech audio (music or sound only): output ”[Non-speech audio: skip analysis]” ...
-
[11]
Comparison with Subtitle: - Directly compare the Speech Content text with the provided subtitle. - Discard any speech content marked with neutral emotion (e.g., ”neutral tone”, ”neutral mood”) immediately, regardless of subtitle alignment. - For non-neutral speech content: - If the speech content contains phrases that overlap or align with the subtitle (e...
-
[12]
- Preserve all other emotional/tonal descriptors (e.g., ”sad mood,” ”English accent”)
Output of Emotional Information: For each utterance: - Only if the vocal traits aligns with the subtitle and is non-neutral: - Replace any speech content in the vocal traits (i.e., quoted or referenced dialogue) with the exact matching phrase from the subtitle. - Preserve all other emotional/tonal descriptors (e.g., ”sad mood,” ”English accent”). - Format...
-
[13]
Question: A question related to the video clip
-
[15]
Explanation: An explanation supporting the answer. Your task is to evaluate whether the question can be answered using only one modality (either video or audio) or if it requires both modalities. Please strictly base your judgment on the information explicitly required to answer the question, as well as the content of the provided answer and explanation. ...
-
[16]
Information Analysis: Analyze the question to identify the specific visual and auditory details required to answer it. Does the question require visual details (e.g., objects, actions, or settings) or auditory details (e.g., speech, sound effects, or music)? Extract the visual and the auditory information from the explanation to determine which modalities...
-
[17]
- Determine if the question can be answered using only the extracted video text
Modality Assessment: Based on the analysis of the question and explanation, determine if the required information can be obtained entirely from one modality (either video or audio) or if both audio and visual modalities are necessary. - Determine if the question can be answered using only the extracted video text. - Determine if the question can be answer...
-
[18]
Conclusion: Provide your final determination: Output [YES] if the question explicitly requires information from both video and audio modalities to be answered correctly, or if the answer and explanation rely on information from both modalities. Otherwise, output [YES]. Here is the question:{question} Here is the answer:{answer} Here is the explanation:{ex...
-
[19]
Question: A question related to a given context
-
[20]
Answer: An answer provided to the question
-
[21]
Explanation: An explanation supporting the answer. Your task is to evaluate the quality of the question-answer pair by performing two checks: format check and content check. Please strictly base your judgment on the information explicitly provided in the question, answer, and explanation. Avoid making assumptions beyond what is stated. Please follow these...
-
[22]
- Check if the answer addresses all the pieces of information requested in the question
Format Check: - Analyze the question to determine how many distinct pieces of information it is asking for. - Check if the answer addresses all the pieces of information requested in the question. - If the question asks for only one piece of information and the answer fully addresses it, proceed to the content check. - If the question asks for multiple pi...
-
[23]
Content Check: - Analyze the explanation to determine if it is reasonable and logically sound. - Check if the answer can be derived from the explanation and if the answer is correct based on the context of the question. - If the explanation is reasonable and the answer is correct and supported by the explanation, output ‘[YES]‘. - If the explanation is un...
-
[24]
Speculation Check: - Analyze the explanation to determine if the answer relies too heavily on speculation rather than concrete evidence or logical reasoning. - If the explanation provides clear, evidence-based reasoning or logical steps to derive the answer, proceed to the final output. - If the explanation relies on assumptions, guesses, or unsupported c...
-
[25]
Search through all video segments to find its first occurrence
-
[26]
Record for each element: - Segment ID of first appearance - Modality type (video caption/subtitle/speech emotion)
-
[27]
If any element cannot be found→Output ”[NO] (unverifiable element: [element name])” Stage 3: Element Validation Verify the located elements meet these criteria:
-
[28]
Unique Segment Check: - All elements must appear in different segments - If any segment ID is shared→Output ”[NO] (co-occurring elements: [element1] & [element2] in segment X)”
-
[29]
Modality Diversity Check: - Elements must come from≥2 different modalities - If all same modality→Output ”[NO] (single modality: [modality type])” Stage 4: Order Verification
-
[30]
Sort elements by their first appearance segment ID (ascending)
-
[31]
Compare against provided answer: - If orders match→Output ”[YES]” - If orders differ→Output ”[Corrected]” with proper order and explanation Output Format: [Validating]¡4-stage analysis¿ [Output]: [YES/NO/Corrected] [Corrected: (a) (b) (c)](if applicable) [Explanation](if Corrected): - (a) [element1]: first appears in segment [X] ([modality]) - (b) [elemen...
-
[32]
- Check if this music information appears in the provided ‘Music Content‘
Phase 1: Music Information Validation - Extract the music-related information used in the ‘Answer‘ and ‘Explanation‘ of the QA pair. - Check if this music information appears in the provided ‘Music Content‘. - If the music information is not found in the ‘Music Content‘, output ‘[NO]‘ (invalid QA pair)
-
[33]
Phase 2: Visual Information Cross-Check - If the music information is valid (from Phase 1), analyze whether the emotion/atmosphere described in the music can also be inferred from the ‘Video Caption‘ (e.g., character expressions, scene mood, or events). - Example: If the music mentions a ”sad atmosphere” and the video shows ”a character crying,” the music...
-
[34]
Phase 3: Final Judgment - If the QA pair passes both Phase 1 and Phase 2 (i.e., music info is valid and cannot be inferred visually), output ‘[YES]‘. - Otherwise, output ‘[NO]‘. Here is the provided information:{segments info} - Question:{question} - Answer:{answer} - Explanation:{explanation} Perform the three-phase analysis and output either ’[YES]’ or ...
-
[35]
A question-answer pair
-
[36]
An explanation of how the answer is derived
-
[37]
Complete information about all movie segments (timestamps, video descriptions, audio descriptions, and subtitles) Instructions:
-
[38]
Carefully analyze the question and answer explanation to understand what information is required
-
[39]
Examine all movie segments sequentially to locate where the relevant information begins
-
[40]
Determine where the last necessary piece of information appears
-
[41]
Select the earliest segment where required information starts (start segment) and the latest segment where required information ends (end segment) Ensure:
-
[42]
The selected segments form a continuous sequence
-
[43]
The sequence is not from the very first to the very last segment
-
[44]
All information needed to answer the question is contained within this sequence
-
[45]
Make sure that the generated output follow output format
The sequence is as compact as possible Output Format: Provide your response in this exact format: [Start]: ¡segment number¿[End]: ¡segment number¿[Rationale]: ¡brief explanation of why these segments were chosen¿ Important Notes: - If the answer requires information that only appears in disjoint segments, select the smallest continuous sequence that conta...
-
[46]
**Selective Modification**: Alter specific elements such as character actions, dialogue, objects, or settings to create plausible yet incorrect options
-
[47]
**Maintain Plausibility**: Ensure each distractor could feasibly occur within the context of the video, making them appear credible based on the visual and audio cues
-
[48]
**Incorporate Diverse Misdirections**: - **Action Confusion**: Modify or swap character actions or events in ways that fit the context but are incorrect. - **Dialogue Adjustments**: Propose believable alterations to dialogue or audio cues that didn’t actually occur. - **Object or Setting Misdirection**: Suggest plausible but incorrect details about object...
-
[49]
**Incorporate Partial Truths**: Use true audio-visual details or partial truths within the distractors to add complexity, ensuring these elements do not directly answer the question but make the distractors more compelling
-
[50]
**Avoid Obvious Falsities**: Shift the context or details significantly without creating options that are blatantly wrong or unrelated to the video
-
[51]
**Ensure Distinct Incorrectness**: Craft distractors that will be clearly identifiable as incorrect by someone who has closely watched and listened to the video, challenging their attention to detail. Requirements for Distractors:
-
[52]
Plausibility: Each distractor should seem correct at first glance, matching the tone and structure of the correct answer
-
[53]
Variety: Errors should vary (e.g., minor inaccuracies, flipped terms, oversimplifications, or common misconceptions)
-
[54]
Consistency: Maintain the same verb tense, technicality, and formatting as the correct answer. Format the output as follows: [Distractor 1]¡Incorrect but plausible option¿ [Distractor 2]¡Incorrect but plausible option¿ [Distractor 3]¡Incorrect but plausible option¿ Provided Information: Background infromation:{segments info} Question:{question} Correct An...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.