pith. machine review for the scientific record. sign in

arxiv: 2604.23860 · v1 · submitted 2026-04-26 · 💻 cs.CV · cs.AI

Recognition: unknown

Exploring Audio Hallucination in Egocentric Video Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords audio hallucinationegocentric videoaudio-visual language modelsmultimodal evaluationsound understandingvideo understandinghallucination taxonomy
0
0 comments X

The pith

State-of-the-art audio-visual language models frequently hallucinate sounds they cannot hear when processing egocentric videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that advanced AV-LLMs often generate descriptions of sounds inferred from visible cues rather than from actual audio input, particularly in first-person videos where camera motion makes visuals unreliable. This matters because sound provides essential information for recognizing user activities and surroundings in such settings. The authors introduce a question-answering evaluation using 300 curated egocentric videos and 1,000 sound-focused questions, together with a taxonomy that separates foreground action sounds from background ambient sounds. Their tests on models such as Qwen2.5 Omni yield accuracies of only 27.3 percent for foreground sounds and 39.5 percent for background sounds. The work therefore stresses that reliable AV-LLMs depend on systematic measurement of audio hallucination.

Core claim

We show that advanced AV-LLMs such as Qwen2.5 Omni are prone to audio hallucinations in egocentric video understanding, often inferring sounds from visual cues that are visible but not heard. We present a systematic evaluation framework that uses a curated dataset of 300 videos, 1,000 sound-focused questions, and a grounded taxonomy distinguishing foreground action sounds from background ambient sounds. The evaluation reveals high hallucination rates, with accuracies of 27.3 percent on foreground-related questions and 39.5 percent on background-related questions. This demonstrates the need to measure the reliability of multimodal responses for developing dependable AV-LLMs.

What carries the argument

A targeted question-answering protocol applied to 300 egocentric videos and 1,000 sound-focused questions, supported by a taxonomy that separates foreground action sounds from background ambient sounds.

If this is right

  • AV-LLMs cannot reliably ground audio descriptions to actual heard sounds in dynamic egocentric recordings.
  • General video-description benchmarks miss the specific failures that sound-focused questions expose.
  • Robust, hallucination-targeted evaluation is required to build reliable audio-visual language models.
  • Models must learn to withhold sound claims when no corresponding audio is present in the input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • In wearable-camera or AR applications, audio hallucinations could produce misleading activity logs or environmental alerts.
  • Training regimes that penalize sound generation without matching audio input may reduce these errors across other unstable-camera domains.
  • Extending the same Q/A protocol to real-time streaming video would test whether hallucination rates increase under latency constraints.
  • The foreground-background split may generalize to other multimodal tasks where one modality is intermittently reliable.

Load-bearing premise

The 300-video dataset and 1,000 sound-focused questions accurately isolate true audio hallucinations from correct multimodal inferences or from biases in question design.

What would settle it

Human judges listening only to the audio tracks of the same 300 videos confirm that models describe sounds absent from the audio while correctly answering questions when the audio track is supplied without video.

read the original abstract

Egocentric videos provide a distinctive setting in which sound serves as crucial cues to understand user activities and surroundings, particularly when visual information is unstable or occluded due to continuous camera movement. State-of-the-art large audio-visual language models (AV-LLMs) can generate multimodal descriptions. However, we show in this work that they are prone to audio hallucinations, often inferring sounds from visual cues that are visible but not heard. We present a systematic and automatic evaluation framework for analyzing audio hallucinations in egocentric video through a targeted question-answering (Q/A) protocol. We curate a dataset of 300 egocentric videos and design 1,000 sound-focused questions to probe model outputs. To characterize hallucinations, we propose a grounded taxonomy that distinguishes between foreground action sounds from the user activities and background ambient sounds. Our evaluation shows that advanced AV-LLMs, such as Qwen2.5 Omni, exhibit high hallucination rates, achieving only 27.3% and 39.5% accuracy on Q/As related to foreground and background sounds, respectively. With this work, we highlight the need to measure the reliability of multimodal responses, emphasizing that robust evaluation of hallucinations is essential to develop reliable AV-LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a systematic evaluation framework for audio hallucinations in audio-visual large language models (AV-LLMs) on egocentric videos. It curates a dataset of 300 videos with 1,000 sound-focused questions, proposes a grounded taxonomy distinguishing foreground action sounds from user activities and background ambient sounds, and reports that state-of-the-art models such as Qwen2.5 Omni achieve only 27.3% accuracy on foreground-sound questions and 39.5% on background-sound questions, indicating high rates of audio hallucination.

Significance. If the 300-video dataset and 1,000 questions are shown to isolate audio hallucinations without visual leakage or annotation ambiguity, the work would be significant for exposing reliability gaps in multimodal models for egocentric understanding, where sound cues are critical due to camera motion and occlusion. It could motivate improved training objectives and evaluation protocols for AV-LLMs.

major comments (2)
  1. [Abstract / Dataset curation] Abstract and dataset curation section: The headline accuracies (27.3% foreground, 39.5% background) are the sole quantitative support for the central claim, yet the manuscript supplies no information on how ground-truth labels for the 1,000 Q/As were established (e.g., audio-only verification of sound presence/absence, inter-annotator agreement, or explicit controls preventing questions from being answerable from vision alone). Without these, the reported hallucination rates cannot be interpreted as isolating true audio hallucinations.
  2. [Evaluation framework] Evaluation protocol: No details are given on question design validation, video selection criteria, or error bars around the accuracy figures. This directly affects whether the low accuracies reflect model hallucination rather than ambiguous or visually inferable cases, undermining the claim that AV-LLMs 'often infer sounds from visual cues that are visible but not heard.'
minor comments (2)
  1. [Taxonomy] The taxonomy is described as 'grounded' but the manuscript does not illustrate it with concrete examples of foreground vs. background questions or model outputs.
  2. [Abstract] The abstract mentions an 'automatic evaluation framework' but does not clarify whether the Q/A scoring is fully automatic or involves human judgment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies key areas where additional transparency is needed to strengthen the interpretation of our results on audio hallucinations in AV-LLMs. We will revise the manuscript to incorporate the requested details on dataset curation and evaluation protocols. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract / Dataset curation] Abstract and dataset curation section: The headline accuracies (27.3% foreground, 39.5% background) are the sole quantitative support for the central claim, yet the manuscript supplies no information on how ground-truth labels for the 1,000 Q/As were established (e.g., audio-only verification of sound presence/absence, inter-annotator agreement, or explicit controls preventing questions from being answerable from vision alone). Without these, the reported hallucination rates cannot be interpreted as isolating true audio hallucinations.

    Authors: We agree that the current manuscript would be strengthened by explicit details on ground-truth establishment. In the revised version, we will expand the Dataset Curation section to describe the annotation process: annotators first reviewed audio tracks independently to identify present foreground action sounds and background ambient sounds, then formulated targeted questions aligned with our taxonomy. We will also document controls for visual leakage, such as prioritizing egocentric scenarios with motion-induced occlusion or ambiguity where visual cues alone are insufficient to answer the questions. Inter-annotator agreement will be reported for the labels to demonstrate reliability. revision: yes

  2. Referee: [Evaluation framework] Evaluation protocol: No details are given on question design validation, video selection criteria, or error bars around the accuracy figures. This directly affects whether the low accuracies reflect model hallucination rather than ambiguous or visually inferable cases, undermining the claim that AV-LLMs 'often infer sounds from visual cues that are visible but not heard.'

    Authors: We thank the referee for highlighting these omissions. In the revision, we will add a dedicated subsection on the evaluation protocol. This will specify video selection criteria (e.g., clips drawn from egocentric datasets filtered for audio event presence and visual challenges due to camera motion), question design validation (including alignment with the foreground/background taxonomy and human pilot checks to confirm audio dependence), and clarification on error bars. As the evaluation uses a fixed set of questions and deterministic model inference, statistical variance was not computed in the original submission; however, we will include cross-model consistency analysis and, where appropriate, prompt-variation results to provide robustness estimates. revision: yes

Circularity Check

0 steps flagged

No significant circularity: pure empirical measurement study

full rationale

The paper curates 300 egocentric videos and 1,000 questions, applies a taxonomy distinguishing foreground/background sounds, and reports direct accuracy measurements (27.3% and 39.5%) on AV-LLM outputs. No equations, derivations, fitted parameters, or predictions exist that could reduce to inputs by construction. No self-citations support uniqueness theorems, ansatzes, or load-bearing premises. The evaluation protocol is self-contained as a benchmark measurement; reported numbers are observed outcomes against the authors' annotations rather than any re-derived or fitted quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the authors' question set and video curation faithfully measure hallucinations without bias. No free parameters, mathematical axioms, or invented entities are introduced; the work is purely empirical.

pith-pipeline@v0.9.0 · 5555 in / 1185 out tokens · 39581 ms · 2026-05-08T06:34:38.691167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 10 canonical work pages · 8 internal anchors

  1. [1]

    Exploring Audio Hallucination in Egocentric Video Understanding

    INTRODUCTION Audio–visual language models (A V-LLMs) and multimodal large language models (MLLMs) have demonstrated impres- sive abilities to generate rich descriptions from both videos and audios, supporting tasks such as event understanding, ac- tion recognition, and sound description [1, 2]. Yet, despite their multimodal design, these models often hall...

  2. [2]

    RELA TED WORK Audio-Visual Language Models (A V-LLMs).Recent ad- vances in A V-LLMs and multimodal large language mod- els (MLLMs) have enabled rich multimodal understanding, including video captioning, sound description, and action recognition [1, 2]. These models integrate visual and audi- tory modalities to generate coherent descriptions; however, they...

  3. [3]

    mechanical sound

    METHODOLOGY 3.1. Dataset Curation To systematically probe the influence of visual context on audio perception in egocentric videos, we introduce a curated dataset derived from 300 clips of the Ego4D corpus [5], en- compassing 50 distinct visual scenarios. As shown in Fig 2, our data curation pipeline, starts with the segmentation of the source audio into ...

  4. [4]

    Our evaluation includes ImageBind-LLM [11], VideoL- LaMA2 [8], Qwen2.5-Omni [9] with 7B active parameters, and MiniCPM [12] with 8B active parameters

    EXPERIMENTAL SETUP Baseline Details.We evaluate four state-of-the-art A V- LLMs that jointly process audio and visual information, encompassing diverse architectures and training paradigms. Our evaluation includes ImageBind-LLM [11], VideoL- LaMA2 [8], Qwen2.5-Omni [9] with 7B active parameters, and MiniCPM [12] with 8B active parameters. Notably, while I...

  5. [5]

    the per- son closing the drawers,

    RESULTS 5.1. Quantitative Results Table 1 presents the performance of four A V-LLMs on factual (Fact.) and hallucinatory (Hal.) question answering (Q/A) across both foreground and background sounds in egocen- tric videos. While most A V-LLMs perform close to random guessing, SOTA models such as Qwen2.5 Omni reach 56.2% on foreground factual and 63.4% on b...

  6. [6]

    CONCLUSION We presented the first taxonomy-driven evaluation of audio hallucinations in egocentric videos through a systematic Q/A benchmark. By curating 300 clips and designing 1,000 sound- focused Q/A pairs, we showed that state-of-the-art A V-LLMs often default to visual biases when answering questions about auditory information. Our analysis identifie...

  7. [7]

    A survey on multimodal large language models,

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, pp. nwae403, 2024

  8. [8]

    Deep audio-visual learning: A survey,

    Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He, “Deep audio-visual learning: A survey,”In- ternational Journal of Automation and Computing, vol. 18, no. 3, pp. 351–376, 2021

  9. [9]

    Hallucination of Multimodal Large Language Models: A Survey

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou, “Hallucination of multimodal large language models: A survey,”arXiv preprint arXiv:2404.18930, 2024

  10. [10]

    Egoillusion: Benchmarking hallucinations in egocentric video understanding,

    Ashish Seth, Utkarsh Tyagi, Ramaneswaran Selvaku- mar, Nishit Anand, Sonal Kumar, Sreyan Ghosh, Ramani Duraiswami, Chirag Agarwal, and Dinesh Manocha, “Egoillusion: Benchmarking hallucinations in egocentric video understanding,”arXiv preprint arXiv:2508.12687, 2025

  11. [11]

    Ego4d: Around the world in 3,000 hours of egocen- tric video,

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jack- son Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al., “Ego4d: Around the world in 3,000 hours of egocen- tric video,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18995–19012

  12. [12]

    Scaling egocentric vision: The epic-kitchens dataset,

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al., “Scaling egocentric vision: The epic-kitchens dataset,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 720–736

  13. [13]

    Action2sound: Ambient-aware generation of action sounds from egocentric videos,

    Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, and Kristen Grauman, “Action2sound: Ambient-aware generation of action sounds from egocentric videos,” inEuropean Confer- ence on Computer Vision. Springer, 2024, pp. 277–295

  14. [14]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al., “Videollama 2: Advanc- ing spatial-temporal modeling and audio understanding in video-llms,”arXiv preprint arXiv:2406.07476, 2024

  15. [15]

    Qwen2.5-Omni Technical Report

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

  16. [16]

    The llama 3 herd of models,

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al., “The llama 3 herd of models,”arXiv e-prints, pp. arXiv–2407, 2024

  17. [17]

    arXiv preprint arXiv:2309.03905 , year=

    Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al., “Imagebind-llm: Multi-modality instruction tuning,”arXiv preprint arXiv:2309.03905, 2023

  18. [18]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al., “Minicpm: Unveiling the potential of small language models with scalable train- ing strategies,”arXiv preprint arXiv:2404.06395, 2024

  19. [19]

    Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos,

    Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, and Federico Tombari, “Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 24129–24138

  20. [20]

    NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping, “Nv-embed: Improved techniques for train- ing llms as generalist embedding models,”arXiv preprint arXiv:2405.17428, 2024

  21. [21]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al., “A survey on llm- as-a-judge,”arXiv preprint arXiv:2411.15594, 2024

  22. [22]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Ak- ila Welihinda, Alan Hayes, Alec Radford, et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024