When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

Chen-An Li; Hung-yi Lee; Tzu-Han Lin

arxiv: 2510.00626 · v3 · submitted 2025-10-01 · 💻 cs.SD · cs.CL

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

Chen-An Li , Tzu-Han Lin , Hung-yi Lee This is my paper

Pith reviewed 2026-05-18 11:05 UTC · model grok-4.3

classification 💻 cs.SD cs.CL

keywords large audio-language modelsirrelevant audiotext reasoningmodel robustnesscross-modal interferenceprediction volatilitysilence impact

0 comments

The pith

Even non-informative audio including silence reduces accuracy and raises volatility on text reasoning tasks in large audio-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines what happens when large audio-language models receive audio that carries no useful information for a text-only reasoning problem. It reports consistent drops in accuracy together with greater instability in the model outputs. The amount of interference grows when the audio lasts longer, plays louder, or when the model samples at higher temperatures. Silence turns out to be as disruptive as synthetic noise or environmental sounds. Larger models suffer less but still show the same pattern, and simple prompting fails to remove the effect while self-consistency helps at extra compute cost.

Core claim

Appending irrelevant audio to text inputs causes large audio-language models to produce less accurate and more variable answers on standard text reasoning benchmarks, with the degree of degradation increasing with audio duration, amplitude, and decoding temperature, and with silence proving comparably harmful to synthetic noise or environmental sounds.

What carries the argument

Cross-modal interference triggered by non-informative audio channels, quantified through accuracy loss and output volatility that scale with duration, amplitude, and sampling temperature.

If this is right

Larger models display greater resistance to the interference but retain measurable vulnerabilities.
Prompt-based instructions produce only marginal improvement in stability.
Self-consistency decoding reduces volatility at the expense of higher inference cost.
The observed interference constitutes a general robustness limitation across the tested systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures that more cleanly separate or gate modalities could reduce this form of cross-talk without extra decoding steps.
The same pattern may appear in other multimodal systems whenever an unneeded input channel is present.
Real-world recordings with natural background sound would provide a direct test of whether the controlled findings generalize.

Load-bearing premise

The chosen text benchmarks stay purely text-based reasoning problems even after audio is added, without any hidden signal or training artifact that would let the model treat the audio as relevant.

What would settle it

Measuring whether accuracy and volatility return to the no-audio baseline when the audio input is replaced by a zeroed waveform or entirely omitted while keeping every other model component fixed.

read the original abstract

Large audio-language models (LALMs) unify speech and text processing, but their robustness in noisy real-world settings remains underexplored. We investigate how irrelevant audio, such as silence, synthetic noise, and environmental sounds, affects text reasoning tasks where audio is unnecessary. Across three text-based benchmarks, we find that even non-informative audio reduces accuracy and increases prediction volatility; the severity of interference scales with longer durations, higher amplitudes, and elevated decoding temperatures. Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise. While larger models show greater resilience, vulnerabilities persist across all evaluated systems. We further test mitigation strategies and find that prompting shows limited effectiveness, whereas self-consistency improves stability at the cost of increased computation. Our results reveal cross-modal interference as a key robustness challenge and highlight the need for efficient fusion strategies that preserve reasoning performance in the presence of irrelevant inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds that silence and irrelevant audio degrade text reasoning accuracy in LALMs roughly as much as noise does, with effects scaling by duration and temperature, but the controls for ruling out input artifacts look thin.

read the letter

The central observation is that tacking on silence or synthetic noise to text-only prompts lowers accuracy on reasoning tasks and makes outputs more variable, with bigger drops for longer clips, louder audio, and higher sampling temperatures. Larger models handle it better but still show the issue. They also check a couple of mitigation ideas and find self-consistency helps more than extra prompting. That pattern is worth knowing for anyone deploying these models in real environments with background sound. What stands out as new is the direct comparison showing silence performs about as poorly as noise, plus the scaling checks across duration, amplitude, and temperature on three separate benchmarks. The work is a clean empirical probe rather than a theoretical claim, and they do test a few practical fixes. The main soft spot is the assumption that appending audio leaves the underlying task strictly text-based. If the fusion layers or prompt formatting let even zeroed audio shift attention or hidden states, the accuracy drop could come from that side effect instead of genuine cross-modal interference. The abstract gives no numbers on statistical significance, no error bars, and no explicit description of the text-only baseline or audio encoder state, so the evidence stays directional rather than conclusive. This is the sort of paper that matters for people building or evaluating audio-language systems who care about robustness in noisy settings. It does not upend the field but it flags a deployment-relevant failure mode that deserves follow-up. I would send it for peer review because the question is practical and the reported pattern is plausible enough to warrant closer scrutiny of the methods and controls.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates the effects of appending irrelevant audio (silence, synthetic noise, environmental sounds) to text reasoning tasks in large audio-language models. Across three text-based benchmarks, it reports that non-informative audio reduces accuracy and increases prediction volatility, with interference severity scaling with longer durations, higher amplitudes, and elevated decoding temperatures. Larger models show greater resilience, but vulnerabilities persist; prompting offers limited mitigation while self-consistency improves stability at higher computational cost.

Significance. If the central empirical findings hold after addressing controls and statistical reporting, the work would be significant for highlighting cross-modal interference as a robustness issue in LALMs, challenging the neutrality of silence and non-informative inputs. It offers practical observations on scaling behaviors and mitigation trade-offs that could guide fusion mechanism design, though the current lack of error bars and baseline verification limits immediate impact.

major comments (2)

[Experimental Methodology] The description of audio input construction, audio encoder state for zeroed/silent inputs, prompt formatting, and explicit text-only baseline controls is insufficient to confirm that the three benchmarks remain strictly text-only reasoning tasks. Without these details, accuracy drops and volatility increases could arise from architectural side-effects, tokenization changes, or attention modulation rather than cross-modal interference (see skeptic concern on unintended signals).
[Results] The reported accuracy reductions and volatility increases lack error bars, statistical significance tests, confidence intervals, or details on data exclusion rules. This makes it difficult to evaluate the reliability of the directional effects and scaling claims across the three benchmarks and model sizes.

minor comments (2)

[Methods] Define 'prediction volatility' explicitly and describe its exact computation (e.g., variance over multiple runs or entropy).
[Experimental Setup] Expand the model and benchmark details (specific LALMs, exact datasets, audio generation parameters) to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and outline the revisions planned to improve clarity and rigor.

read point-by-point responses

Referee: [Experimental Methodology] The description of audio input construction, audio encoder state for zeroed/silent inputs, prompt formatting, and explicit text-only baseline controls is insufficient to confirm that the three benchmarks remain strictly text-only reasoning tasks. Without these details, accuracy drops and volatility increases could arise from architectural side-effects, tokenization changes, or attention modulation rather than cross-modal interference (see skeptic concern on unintended signals).

Authors: We agree that additional methodological detail is required to rule out alternative explanations. In the revised manuscript we will expand the relevant section to describe audio input construction in full, specify the audio encoder state for silent and zeroed inputs, document prompt formatting exactly, and present explicit text-only baseline controls. These additions will confirm that the benchmarks were kept strictly text-based and that observed effects are attributable to cross-modal interference rather than tokenization or attention artifacts. revision: yes
Referee: [Results] The reported accuracy reductions and volatility increases lack error bars, statistical significance tests, confidence intervals, or details on data exclusion rules. This makes it difficult to evaluate the reliability of the directional effects and scaling claims across the three benchmarks and model sizes.

Authors: We acknowledge that the current results section would benefit from statistical reporting. In revision we will add error bars from repeated runs where available, report confidence intervals, conduct appropriate significance tests for accuracy and volatility differences, and specify data exclusion rules. These changes will strengthen evaluation of the directional and scaling effects across benchmarks and model sizes. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical measurement study

full rationale

The paper is an empirical investigation that appends irrelevant audio to text benchmarks and measures resulting accuracy drops and volatility increases. No equations, derivations, fitted parameters, or self-citation chains are present in the provided text or abstract. Claims rest on direct experimental observations across models and conditions rather than any reduction of outputs to inputs by construction. This matches the default case of a self-contained empirical study against external benchmarks, warranting a score of 0 with no circular steps identified.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical robustness study. It introduces no free parameters, no new mathematical axioms, and no invented entities. It rests on the domain assumption that appended audio is processed by the model even when the task is defined as text-only.

axioms (1)

domain assumption LALMs integrate audio input into their internal representations even for tasks where audio carries no task-relevant information.
This premise is required for the observed interference to be interpreted as cross-modal leakage rather than task misunderstanding.

pith-pipeline@v0.9.0 · 5690 in / 1291 out tokens · 34615 ms · 2026-05-18T11:05:25.965362+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
cs.SD 2026-04 unverdicted novelty 6.0

Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
cs.SD 2026-05 unverdicted novelty 5.0

A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

How- ever, most evaluations assume clean, modality-aligned inputs

INTRODUCTION Large audio-language models (LALMs) [1–7] have shown strong performance across a variety of multimodal tasks, showing the abil- ity to process speech and text in a unified framework [8–16]. How- ever, most evaluations assume clean, modality-aligned inputs. In practice, text reasoning often requires no audio, yet deployed sys- tems still recei...

work page
[2]

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

INVESTIGA TING CROSS-MODAL INTERFERENCE 2.1. Problem Formulation We analyze how large audio-language models (LALMs) handle tasks that rely only on text when the audio channel introduces irrelevant or distracting content. Figure 1 illustrates our problem setup, a text- only reasoning task with irrelevant audio signals such as silence, syn- thetic noise, or...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

silence/noise

ANALYSIS 3.1. Scaling Interference Effects Duration of AudioFigure 3 illustrates the model’s performance and influence rate across different durations of irrelevant audio. The x-axis indicates the duration of added audio,∅represents the clean baseline without any interference, while the values 1, 5, 10, and 30 denote durations of silence and Gaussian nois...

work page
[4]

Focus on the text or audio that contains useful information

STRAIGHTFORW ARD MITIGA TION APPROACHES 4.1. Methodology We evaluate two straightforward mitigation approaches to investi- gate whether simple strategies can alleviate the impact of irrelevant audio. The first approach is adding a mitigation prompt. Specifi- cally, we prepend a short instructional phrase,“Focus on the text or audio that contains useful in...

work page
[5]

Silence, noise, and envi- ronmental sounds disrupted performance, and the impact grew with longer duration, louder volume, and higher decoding temperatures

CONCLUSION Our study shows that irrelevant audio can interfere with how large audio-language models reason over text. Silence, noise, and envi- ronmental sounds disrupted performance, and the impact grew with longer duration, louder volume, and higher decoding temperatures. Even silence, often assumed neutral, proved disruptive, destabilizing outputs as m...

work page
[6]

GPT-4o System Card

OpenAI (2024), “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al., “Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Desta2. 5-audio: Toward general-purpose large audio language model with self-generated cross-modal alignment,

Ke-Han Lu et al., “Desta2. 5-audio: Toward general-purpose large audio language model with self-generated cross-modal alignment,”arXiv preprint arXiv:2507.02768, 2025

work page arXiv 2025
[9]

V oxtral.arXiv preprint arXiv:2507.13264, 2025

Alexander H Liu et al., “V oxtral,”arXiv preprint arXiv:2507.13264, 2025

work page arXiv 2025
[10]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin et al., “Phi-4-mini technical re- port: Compact yet powerful multimodal language models via mixture-of-loras,”arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Qwen2.5-Omni Technical Report

Jin Xu et al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel et al., “Audio flamingo 3: Advancing audio intel- ligence with fully open large audio language models,”arXiv preprint arXiv:2507.08128, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Dynamic-SUPERB phase-2: A collab- oratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,

Chien-yu Huang et al., “Dynamic-SUPERB phase-2: A collab- oratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[14]

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information,

Chih-Kai Yang et al., “SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information,” inInterspeech 2025, 2025, pp. 1788–1792

work page 2025
[15]

MMAU: A massive multi-task audio under- standing and reasoning benchmark,

S Sakshi et al., “MMAU: A massive multi-task audio under- standing and reasoning benchmark,” inThe Thirteenth Inter- national Conference on Learning Representations, 2025

work page 2025
[16]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,

Ziyang Ma et al., “Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,”arXiv preprint arXiv:2505.13032, 2025

work page arXiv 2025
[17]

AIR-bench: Benchmarking large audio- language models via generative comprehension,

Qian Yang et al., “AIR-bench: Benchmarking large audio- language models via generative comprehension,” inProceed- ings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), 2024

work page 2024
[18]

AudioBench: A universal benchmark for au- dio large language models,

Bin Wang et al., “AudioBench: A universal benchmark for au- dio large language models,” inProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Associ- ation for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), 2025

work page 2025
[19]

A preliminary exploration with gpt-4o voice mode,

Yu-Xiang Lin et al., “A preliminary exploration with gpt-4o voice mode,”arXiv preprint arXiv:2502.09940, 2025

work page arXiv 2025
[20]

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Chih-Kai Yang, Neo S Ho, and Hung-yi Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,”arXiv preprint arXiv:2505.15957, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora et al., “On the landscape of spoken lan- guage models: A comprehensive survey,”arXiv preprint arXiv:2504.08528, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Audio adversarial ex- amples: Targeted attacks on speech-to-text,

Nicholas Carlini and David Wagner, “Audio adversarial ex- amples: Targeted attacks on speech-to-text,” in2018 IEEE security and privacy workshops (SPW). IEEE, 2018, pp. 1–7

work page 2018
[23]

SpeechGuard: Exploring the adversar- ial robustness of multi-modal large language models,

Raghuveer Peri et al., “SpeechGuard: Exploring the adversar- ial robustness of multi-modal large language models,” inFind- ings of the Association for Computational Linguistics: ACL 2024, 2024

work page 2024
[24]

Evaluating robustness of large audio lan- guage models to audio injection: An empirical study,

Guanyu Hou et al., “Evaluating robustness of large audio lan- guage models to audio injection: An empirical study,”arXiv preprint arXiv:2505.19598, 2025

work page arXiv 2025
[25]

When audio and text disagree: Reveal- ing text bias in large audio-language models,

Cheng Wang et al., “When audio and text disagree: Reveal- ing text bias in large audio-language models,”arXiv preprint arXiv:2508.15407, 2025

work page arXiv 2025
[26]

Large language models can be easily dis- tracted by irrelevant context,

Freda Shi et al., “Large language models can be easily dis- tracted by irrelevant context,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 31210–31227

work page 2023
[27]

Over-reasoning and re- dundant calculation of large language models,

Cheng-Han Chiang and Hung-yi Lee, “Over-reasoning and re- dundant calculation of large language models,” inProceedings of the 18th Conference of the European Chapter of the Associ- ation for Computational Linguistics (Volume 2: Short Papers), 2024

work page 2024
[28]

Breaking focus: Contextual distraction curse in large language models,

Yanbo Wang et al., “Breaking focus: Contextual distraction curse in large language models,” inWill Synthetic Data Finally Solve the Data Access Problem?, 2025

work page 2025
[29]

On the robustness of multimodal language model towards distractions,

Ming Liu et al., “On the robustness of multimodal language model towards distractions,”arXiv preprint arXiv:2502.09818, 2025

work page arXiv 2025
[30]

Words or vision: Do vision-language mod- els have blind faith in text?,

Ailin Deng et al., “Words or vision: Do vision-language mod- els have blind faith in text?,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3867– 3876

work page 2025
[31]

Diagnosing and mitigating modality interference in multimodal large language models.ArXiv, abs/2505.19616, 2025

Rui Cai et al., “Diagnosing and mitigating modality interfer- ence in multimodal large language models,”arXiv preprint arXiv:2505.19616, 2025

work page arXiv 2025
[32]

Training Verifiers to Solve Math Word Problems

Karl Cobbe et al., “Training verifiers to solve math word prob- lems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark et al., “Think you have solved question answer- ing? try arc, the ai2 reasoning challenge,”arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

Measuring massive multitask language understanding,

Dan Hendrycks et al., “Measuring massive multitask language understanding,” inInternational Conference on Learning Rep- resentations, 2021

work page 2021
[35]

Fsd50k: an open dataset of human- labeled sound events,

Eduardo Fonseca et al., “Fsd50k: an open dataset of human- labeled sound events,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021

work page 2021
[36]

Self-consistency improves chain of thought reasoning in language models,

Xuezhi Wang et al., “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh In- ternational Conference on Learning Representations, 2023

work page 2023
[37]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell et al., “Scaling llm test-time compute optimally can be more effective than scaling model parameters,”arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Efficient memory management for large language model serving with pagedattention,

Woosuk Kwon et al., “Efficient memory management for large language model serving with pagedattention,” inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[39]

Transformers: State-of-the-art natural language processing,

Thomas Wolf et al., “Transformers: State-of-the-art natural language processing,” inProceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Processing: System Demonstrations. 2020, Association for Computational Linguistics

work page 2020

[1] [1]

How- ever, most evaluations assume clean, modality-aligned inputs

INTRODUCTION Large audio-language models (LALMs) [1–7] have shown strong performance across a variety of multimodal tasks, showing the abil- ity to process speech and text in a unified framework [8–16]. How- ever, most evaluations assume clean, modality-aligned inputs. In practice, text reasoning often requires no audio, yet deployed sys- tems still recei...

work page

[2] [2]

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

INVESTIGA TING CROSS-MODAL INTERFERENCE 2.1. Problem Formulation We analyze how large audio-language models (LALMs) handle tasks that rely only on text when the audio channel introduces irrelevant or distracting content. Figure 1 illustrates our problem setup, a text- only reasoning task with irrelevant audio signals such as silence, syn- thetic noise, or...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

silence/noise

ANALYSIS 3.1. Scaling Interference Effects Duration of AudioFigure 3 illustrates the model’s performance and influence rate across different durations of irrelevant audio. The x-axis indicates the duration of added audio,∅represents the clean baseline without any interference, while the values 1, 5, 10, and 30 denote durations of silence and Gaussian nois...

work page

[4] [4]

Focus on the text or audio that contains useful information

STRAIGHTFORW ARD MITIGA TION APPROACHES 4.1. Methodology We evaluate two straightforward mitigation approaches to investi- gate whether simple strategies can alleviate the impact of irrelevant audio. The first approach is adding a mitigation prompt. Specifi- cally, we prepend a short instructional phrase,“Focus on the text or audio that contains useful in...

work page

[5] [5]

Silence, noise, and envi- ronmental sounds disrupted performance, and the impact grew with longer duration, louder volume, and higher decoding temperatures

CONCLUSION Our study shows that irrelevant audio can interfere with how large audio-language models reason over text. Silence, noise, and envi- ronmental sounds disrupted performance, and the impact grew with longer duration, louder volume, and higher decoding temperatures. Even silence, often assumed neutral, proved disruptive, destabilizing outputs as m...

work page

[6] [6]

GPT-4o System Card

OpenAI (2024), “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al., “Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Desta2. 5-audio: Toward general-purpose large audio language model with self-generated cross-modal alignment,

Ke-Han Lu et al., “Desta2. 5-audio: Toward general-purpose large audio language model with self-generated cross-modal alignment,”arXiv preprint arXiv:2507.02768, 2025

work page arXiv 2025

[9] [9]

V oxtral.arXiv preprint arXiv:2507.13264, 2025

Alexander H Liu et al., “V oxtral,”arXiv preprint arXiv:2507.13264, 2025

work page arXiv 2025

[10] [10]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin et al., “Phi-4-mini technical re- port: Compact yet powerful multimodal language models via mixture-of-loras,”arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Qwen2.5-Omni Technical Report

Jin Xu et al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel et al., “Audio flamingo 3: Advancing audio intel- ligence with fully open large audio language models,”arXiv preprint arXiv:2507.08128, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Dynamic-SUPERB phase-2: A collab- oratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,

Chien-yu Huang et al., “Dynamic-SUPERB phase-2: A collab- oratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[14] [14]

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information,

Chih-Kai Yang et al., “SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information,” inInterspeech 2025, 2025, pp. 1788–1792

work page 2025

[15] [15]

MMAU: A massive multi-task audio under- standing and reasoning benchmark,

S Sakshi et al., “MMAU: A massive multi-task audio under- standing and reasoning benchmark,” inThe Thirteenth Inter- national Conference on Learning Representations, 2025

work page 2025

[16] [16]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,

Ziyang Ma et al., “Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,”arXiv preprint arXiv:2505.13032, 2025

work page arXiv 2025

[17] [17]

AIR-bench: Benchmarking large audio- language models via generative comprehension,

Qian Yang et al., “AIR-bench: Benchmarking large audio- language models via generative comprehension,” inProceed- ings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), 2024

work page 2024

[18] [18]

AudioBench: A universal benchmark for au- dio large language models,

Bin Wang et al., “AudioBench: A universal benchmark for au- dio large language models,” inProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Associ- ation for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), 2025

work page 2025

[19] [19]

A preliminary exploration with gpt-4o voice mode,

Yu-Xiang Lin et al., “A preliminary exploration with gpt-4o voice mode,”arXiv preprint arXiv:2502.09940, 2025

work page arXiv 2025

[20] [20]

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Chih-Kai Yang, Neo S Ho, and Hung-yi Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,”arXiv preprint arXiv:2505.15957, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora et al., “On the landscape of spoken lan- guage models: A comprehensive survey,”arXiv preprint arXiv:2504.08528, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Audio adversarial ex- amples: Targeted attacks on speech-to-text,

Nicholas Carlini and David Wagner, “Audio adversarial ex- amples: Targeted attacks on speech-to-text,” in2018 IEEE security and privacy workshops (SPW). IEEE, 2018, pp. 1–7

work page 2018

[23] [23]

SpeechGuard: Exploring the adversar- ial robustness of multi-modal large language models,

Raghuveer Peri et al., “SpeechGuard: Exploring the adversar- ial robustness of multi-modal large language models,” inFind- ings of the Association for Computational Linguistics: ACL 2024, 2024

work page 2024

[24] [24]

Evaluating robustness of large audio lan- guage models to audio injection: An empirical study,

Guanyu Hou et al., “Evaluating robustness of large audio lan- guage models to audio injection: An empirical study,”arXiv preprint arXiv:2505.19598, 2025

work page arXiv 2025

[25] [25]

When audio and text disagree: Reveal- ing text bias in large audio-language models,

Cheng Wang et al., “When audio and text disagree: Reveal- ing text bias in large audio-language models,”arXiv preprint arXiv:2508.15407, 2025

work page arXiv 2025

[26] [26]

Large language models can be easily dis- tracted by irrelevant context,

Freda Shi et al., “Large language models can be easily dis- tracted by irrelevant context,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 31210–31227

work page 2023

[27] [27]

Over-reasoning and re- dundant calculation of large language models,

Cheng-Han Chiang and Hung-yi Lee, “Over-reasoning and re- dundant calculation of large language models,” inProceedings of the 18th Conference of the European Chapter of the Associ- ation for Computational Linguistics (Volume 2: Short Papers), 2024

work page 2024

[28] [28]

Breaking focus: Contextual distraction curse in large language models,

Yanbo Wang et al., “Breaking focus: Contextual distraction curse in large language models,” inWill Synthetic Data Finally Solve the Data Access Problem?, 2025

work page 2025

[29] [29]

On the robustness of multimodal language model towards distractions,

Ming Liu et al., “On the robustness of multimodal language model towards distractions,”arXiv preprint arXiv:2502.09818, 2025

work page arXiv 2025

[30] [30]

Words or vision: Do vision-language mod- els have blind faith in text?,

Ailin Deng et al., “Words or vision: Do vision-language mod- els have blind faith in text?,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3867– 3876

work page 2025

[31] [31]

Diagnosing and mitigating modality interference in multimodal large language models.ArXiv, abs/2505.19616, 2025

Rui Cai et al., “Diagnosing and mitigating modality interfer- ence in multimodal large language models,”arXiv preprint arXiv:2505.19616, 2025

work page arXiv 2025

[32] [32]

Training Verifiers to Solve Math Word Problems

Karl Cobbe et al., “Training verifiers to solve math word prob- lems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[33] [33]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark et al., “Think you have solved question answer- ing? try arc, the ai2 reasoning challenge,”arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[34] [34]

Measuring massive multitask language understanding,

Dan Hendrycks et al., “Measuring massive multitask language understanding,” inInternational Conference on Learning Rep- resentations, 2021

work page 2021

[35] [35]

Fsd50k: an open dataset of human- labeled sound events,

Eduardo Fonseca et al., “Fsd50k: an open dataset of human- labeled sound events,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021

work page 2021

[36] [36]

Self-consistency improves chain of thought reasoning in language models,

Xuezhi Wang et al., “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh In- ternational Conference on Learning Representations, 2023

work page 2023

[37] [37]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell et al., “Scaling llm test-time compute optimally can be more effective than scaling model parameters,”arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Efficient memory management for large language model serving with pagedattention,

Woosuk Kwon et al., “Efficient memory management for large language model serving with pagedattention,” inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[39] [39]

Transformers: State-of-the-art natural language processing,

Thomas Wolf et al., “Transformers: State-of-the-art natural language processing,” inProceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Processing: System Demonstrations. 2020, Association for Computational Linguistics

work page 2020